All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/3] Preventing job distribution to isolated CPUs
@ 2020-06-25 22:34 Nitesh Narayan Lal
  2020-06-25 22:34 ` [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs Nitesh Narayan Lal
                   ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Nitesh Narayan Lal @ 2020-06-25 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-api, frederic, mtosatti, juri.lelli, abelits,
	bhelgaas, linux-pci, rostedt, mingo, peterz, tglx, davem, akpm,
	sfr, stephen, rppt, jinyuqi, zhangshaokun

This patch-set is originated from one of the patches that have been
posted earlier as a part of "Task_isolation" mode [1] patch series
by Alex Belits <abelits@marvell.com>. There are only a couple of
changes that I am proposing in this patch-set compared to what Alex
has posted earlier.
 
 
Context
=======
On a broad level, all three patches that are included in this patch
set are meant to improve the driver/library to respect isolated
CPUs by not pinning any job on it. Not doing so could impact
the latency values in RT use-cases.


Patches
=======
* Patch1:
  The first patch is meant to make cpumask_local_spread()
  aware of the isolated CPUs. It ensures that the CPUs that
  are returned by this API only includes housekeeping CPUs.

* Patch2:
  This patch ensures that a probe function that is called
  using work_on_cpu() doesn't run any task on an isolated CPU.

* Patch3:
  This patch makes store_rps_map() aware of the isolated
  CPUs so that rps don't queue any jobs on an isolated CPU. 


Proposed Changes
================
To fix the above-mentioned issues Alex has used housekeeping_cpumask().
The only changes that I am proposing here are:
- Removing the dependency on CONFIG_TASK_ISOLATION that was proposed by
  Alex. As it should be safe to rely on housekeeping_cpumask()
  even when we don't have any isolated CPUs and we want
  to fall back to using all available CPUs in any of the above scenarios.
- Using both HK_FLAG_DOMAIN and HK_FLAG_WQ in Patch2 & 3, this is
  because we would want the above fixes not only when we have isolcpus but
  also with something like systemd's CPU affinity.


Testing
=======
* Patch 1:
  Fix for cpumask_local_spread() is tested by creating VFs, loading
  iavf module and by adding a tracepoint to confirm that only housekeeping
  CPUs are picked when an appropriate profile is set up and all remaining
  CPUs when no CPU isolation is configured.

* Patch 2:
  To test the PCI fix, I hotplugged a virtio-net-pci from qemu console
  and forced its addition to a specific node to trigger the code path that
  includes the proposed fix and verified that only housekeeping CPUs
  are included via tracepoint.

* Patch 3:
  To test the fix in store_rps_map(), I tried configuring an isolated
  CPU by writing to /sys/class/net/en*/queues/rx*/rps_cpus which
  resulted in 'write error: Invalid argument' error. For the case
  where a non-isolated CPU is writing in rps_cpus the above operation
  succeeded without any error.


Changes from v3[2]:
==================
- In patch 1, replaced HK_FLAG_WQ with HK_FLAG_MANAGED_IRQ based on the
  suggestion from Frederic Weisbecker.

Changes from v2[3]:
==================
Both the following suggestions are from Peter Zijlstra.
- Patch1: Removed the extra while loop from cpumask_local_spread and fixed
  the code styling issues.
- Patch3: Change to use cpumask_empty() for verifying that the requested
  CPUs are available in the the housekeeping CPUs.

Changes from v1[4]:
==================
- Included the suggestions made by Bjorn Helgaas in the commit message.
- Included the 'Reviewed-by' and 'Acked-by' received for Patch-2.


[1] https://patchwork.ozlabs.org/project/netdev/patch/51102eebe62336c6a4e584c7a503553b9f90e01c.camel@marvell.com/
[2] https://patchwork.ozlabs.org/project/linux-pci/cover/20200623192331.215557-1-nitesh@redhat.com/
[3] https://patchwork.ozlabs.org/project/linux-pci/cover/20200622234510.240834-1-nitesh@redhat.com/
[4] https://patchwork.ozlabs.org/project/linux-pci/cover/20200610161226.424337-1-nitesh@redhat.com/


Alex Belits (3):
  lib: Restrict cpumask_local_spread to houskeeping CPUs
  PCI: Restrict probe functions to housekeeping CPUs
  net: Restrict receive packets queuing to housekeeping CPUs

 drivers/pci/pci-driver.c |  5 ++++-
 lib/cpumask.c            | 16 +++++++++++-----
 net/core/net-sysfs.c     | 10 +++++++++-
 3 files changed, 24 insertions(+), 7 deletions(-)

-- 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2020-06-25 22:34 [PATCH v4 0/3] Preventing job distribution to isolated CPUs Nitesh Narayan Lal
@ 2020-06-25 22:34 ` Nitesh Narayan Lal
  2020-06-29 16:11   ` Nitesh Narayan Lal
                     ` (2 more replies)
  2020-06-25 22:34 ` [Patch v4 2/3] PCI: Restrict probe functions to housekeeping CPUs Nitesh Narayan Lal
  2020-06-25 22:34 ` [Patch v4 3/3] net: Restrict receive packets queuing " Nitesh Narayan Lal
  2 siblings, 3 replies; 52+ messages in thread
From: Nitesh Narayan Lal @ 2020-06-25 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-api, frederic, mtosatti, juri.lelli, abelits,
	bhelgaas, linux-pci, rostedt, mingo, peterz, tglx, davem, akpm,
	sfr, stephen, rppt, jinyuqi, zhangshaokun

From: Alex Belits <abelits@marvell.com>

The current implementation of cpumask_local_spread() does not respect the
isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
it will return it to the caller for pinning of its IRQ threads. Having
these unwanted IRQ threads on an isolated CPU adds up to a latency
overhead.

Restrict the CPUs that are returned for spreading IRQs only to the
available housekeeping CPUs.

Signed-off-by: Alex Belits <abelits@marvell.com>
Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 lib/cpumask.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb266f93..85da6ab4fbb5 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
 #include <linux/export.h>
 #include <linux/memblock.h>
 #include <linux/numa.h>
+#include <linux/sched/isolation.h>
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
  */
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-	int cpu;
+	int cpu, hk_flags;
+	const struct cpumask *mask;
 
+	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
+	mask = housekeeping_cpumask(hk_flags);
 	/* Wrap: we always want a cpu. */
-	i %= num_online_cpus();
+	i %= cpumask_weight(mask);
 
 	if (node == NUMA_NO_NODE) {
-		for_each_cpu(cpu, cpu_online_mask)
+		for_each_cpu(cpu, mask) {
 			if (i-- == 0)
 				return cpu;
+		}
 	} else {
 		/* NUMA first. */
-		for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
+		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
 			if (i-- == 0)
 				return cpu;
+		}
 
-		for_each_cpu(cpu, cpu_online_mask) {
+		for_each_cpu(cpu, mask) {
 			/* Skip NUMA nodes, done above. */
 			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
 				continue;
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Patch v4 2/3] PCI: Restrict probe functions to housekeeping CPUs
  2020-06-25 22:34 [PATCH v4 0/3] Preventing job distribution to isolated CPUs Nitesh Narayan Lal
  2020-06-25 22:34 ` [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs Nitesh Narayan Lal
@ 2020-06-25 22:34 ` Nitesh Narayan Lal
  2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Alex Belits
  2020-06-25 22:34 ` [Patch v4 3/3] net: Restrict receive packets queuing " Nitesh Narayan Lal
  2 siblings, 1 reply; 52+ messages in thread
From: Nitesh Narayan Lal @ 2020-06-25 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-api, frederic, mtosatti, juri.lelli, abelits,
	bhelgaas, linux-pci, rostedt, mingo, peterz, tglx, davem, akpm,
	sfr, stephen, rppt, jinyuqi, zhangshaokun

From: Alex Belits <abelits@marvell.com>

pci_call_probe() prevents the nesting of work_on_cpu() for a scenario
where a VF device is probed from work_on_cpu() of the PF.

Replace the cpumask used in pci_call_probe() from all online CPUs to only
housekeeping CPUs. This is to ensure that there are no additional latency
overheads caused due to the pinning of jobs on isolated CPUs.

Signed-off-by: Alex Belits <abelits@marvell.com>
Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
 drivers/pci/pci-driver.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index da6510af1221..449466f71040 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -12,6 +12,7 @@
 #include <linux/string.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/sched/isolation.h>
 #include <linux/cpu.h>
 #include <linux/pm_runtime.h>
 #include <linux/suspend.h>
@@ -333,6 +334,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 			  const struct pci_device_id *id)
 {
 	int error, node, cpu;
+	int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
 	struct drv_dev_and_id ddi = { drv, dev, id };
 
 	/*
@@ -353,7 +355,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	    pci_physfn_is_probed(dev))
 		cpu = nr_cpu_ids;
 	else
-		cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+		cpu = cpumask_any_and(cpumask_of_node(node),
+				      housekeeping_cpumask(hk_flags));
 
 	if (cpu < nr_cpu_ids)
 		error = work_on_cpu(cpu, local_pci_probe, &ddi);
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Patch v4 3/3] net: Restrict receive packets queuing to housekeeping CPUs
  2020-06-25 22:34 [PATCH v4 0/3] Preventing job distribution to isolated CPUs Nitesh Narayan Lal
  2020-06-25 22:34 ` [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs Nitesh Narayan Lal
  2020-06-25 22:34 ` [Patch v4 2/3] PCI: Restrict probe functions to housekeeping CPUs Nitesh Narayan Lal
@ 2020-06-25 22:34 ` Nitesh Narayan Lal
  2020-06-26 11:14   ` Peter Zijlstra
  2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Alex Belits
  2 siblings, 2 replies; 52+ messages in thread
From: Nitesh Narayan Lal @ 2020-06-25 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-api, frederic, mtosatti, juri.lelli, abelits,
	bhelgaas, linux-pci, rostedt, mingo, peterz, tglx, davem, akpm,
	sfr, stephen, rppt, jinyuqi, zhangshaokun

From: Alex Belits <abelits@marvell.com>

With the existing implementation of store_rps_map(), packets are queued
in the receive path on the backlog queues of other CPUs irrespective of
whether they are isolated or not. This could add a latency overhead to
any RT workload that is running on the same CPU.

Ensure that store_rps_map() only uses available housekeeping CPUs for
storing the rps_map.

Signed-off-by: Alex Belits <abelits@marvell.com>
Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 net/core/net-sysfs.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e353b822bb15..677868fea316 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,6 +11,7 @@
 #include <linux/if_arp.h>
 #include <linux/slab.h>
 #include <linux/sched/signal.h>
+#include <linux/sched/isolation.h>
 #include <linux/nsproxy.h>
 #include <net/sock.h>
 #include <net/net_namespace.h>
@@ -741,7 +742,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 {
 	struct rps_map *old_map, *map;
 	cpumask_var_t mask;
-	int err, cpu, i;
+	int err, cpu, i, hk_flags;
 	static DEFINE_MUTEX(rps_map_mutex);
 
 	if (!capable(CAP_NET_ADMIN))
@@ -756,6 +757,13 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 		return err;
 	}
 
+	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+	cpumask_and(mask, mask, housekeeping_cpumask(hk_flags));
+	if (cpumask_empty(mask)) {
+		free_cpumask_var(mask);
+		return -EINVAL;
+	}
+
 	map = kzalloc(max_t(unsigned int,
 			    RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
 		      GFP_KERNEL);
-- 
2.18.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Patch v4 3/3] net: Restrict receive packets queuing to housekeeping CPUs
  2020-06-25 22:34 ` [Patch v4 3/3] net: Restrict receive packets queuing " Nitesh Narayan Lal
@ 2020-06-26 11:14   ` Peter Zijlstra
  2020-06-26 17:20     ` David Miller
  2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Alex Belits
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Zijlstra @ 2020-06-26 11:14 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: linux-kernel, linux-api, frederic, mtosatti, juri.lelli, abelits,
	bhelgaas, linux-pci, rostedt, mingo, tglx, davem, akpm, sfr,
	stephen, rppt, jinyuqi, zhangshaokun

On Thu, Jun 25, 2020 at 06:34:43PM -0400, Nitesh Narayan Lal wrote:
> From: Alex Belits <abelits@marvell.com>
> 
> With the existing implementation of store_rps_map(), packets are queued
> in the receive path on the backlog queues of other CPUs irrespective of
> whether they are isolated or not. This could add a latency overhead to
> any RT workload that is running on the same CPU.
> 
> Ensure that store_rps_map() only uses available housekeeping CPUs for
> storing the rps_map.
> 
> Signed-off-by: Alex Belits <abelits@marvell.com>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

Dave, ACK if I route this?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 3/3] net: Restrict receive packets queuing to housekeeping CPUs
  2020-06-26 11:14   ` Peter Zijlstra
@ 2020-06-26 17:20     ` David Miller
  0 siblings, 0 replies; 52+ messages in thread
From: David Miller @ 2020-06-26 17:20 UTC (permalink / raw)
  To: peterz
  Cc: nitesh, linux-kernel, linux-api, frederic, mtosatti, juri.lelli,
	abelits, bhelgaas, linux-pci, rostedt, mingo, tglx, akpm, sfr,
	stephen, rppt, jinyuqi, zhangshaokun

From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 26 Jun 2020 13:14:01 +0200

> On Thu, Jun 25, 2020 at 06:34:43PM -0400, Nitesh Narayan Lal wrote:
>> From: Alex Belits <abelits@marvell.com>
>> 
>> With the existing implementation of store_rps_map(), packets are queued
>> in the receive path on the backlog queues of other CPUs irrespective of
>> whether they are isolated or not. This could add a latency overhead to
>> any RT workload that is running on the same CPU.
>> 
>> Ensure that store_rps_map() only uses available housekeeping CPUs for
>> storing the rps_map.
>> 
>> Signed-off-by: Alex Belits <abelits@marvell.com>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> 
> Dave, ACK if I route this?

No problem:

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2020-06-25 22:34 ` [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs Nitesh Narayan Lal
@ 2020-06-29 16:11   ` Nitesh Narayan Lal
  2020-07-01  0:32     ` Andrew Morton
  2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Alex Belits
  2021-01-27 11:57   ` [Patch v4 1/3] " Robin Murphy
  2 siblings, 1 reply; 52+ messages in thread
From: Nitesh Narayan Lal @ 2020-06-29 16:11 UTC (permalink / raw)
  To: linux-kernel, linux-api, peterz
  Cc: frederic, mtosatti, juri.lelli, abelits, bhelgaas, linux-pci,
	rostedt, mingo, tglx, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun


[-- Attachment #1.1: Type: text/plain, Size: 2385 bytes --]


On 6/25/20 6:34 PM, Nitesh Narayan Lal wrote:
> From: Alex Belits <abelits@marvell.com>
>
> The current implementation of cpumask_local_spread() does not respect the
> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> it will return it to the caller for pinning of its IRQ threads. Having
> these unwanted IRQ threads on an isolated CPU adds up to a latency
> overhead.
>
> Restrict the CPUs that are returned for spreading IRQs only to the
> available housekeeping CPUs.
>
> Signed-off-by: Alex Belits <abelits@marvell.com>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

Hi Peter,

I just realized that Yuqi jin's patch [1] that modifies cpumask_local_spread is
lying in linux-next.
Should I do a re-post by re-basing the patches on the top of linux-next?

[1]
https://lore.kernel.org/lkml/1582768688-2314-1-git-send-email-zhangshaokun@hisilicon.com/

> ---
>  lib/cpumask.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index fb22fb266f93..85da6ab4fbb5 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -6,6 +6,7 @@
>  #include <linux/export.h>
>  #include <linux/memblock.h>
>  #include <linux/numa.h>
> +#include <linux/sched/isolation.h>
>  
>  /**
>   * cpumask_next - get the next cpu in a cpumask
> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>   */
>  unsigned int cpumask_local_spread(unsigned int i, int node)
>  {
> -	int cpu;
> +	int cpu, hk_flags;
> +	const struct cpumask *mask;
>  
> +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> +	mask = housekeeping_cpumask(hk_flags);
>  	/* Wrap: we always want a cpu. */
> -	i %= num_online_cpus();
> +	i %= cpumask_weight(mask);
>  
>  	if (node == NUMA_NO_NODE) {
> -		for_each_cpu(cpu, cpu_online_mask)
> +		for_each_cpu(cpu, mask) {
>  			if (i-- == 0)
>  				return cpu;
> +		}
>  	} else {
>  		/* NUMA first. */
> -		for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
> +		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>  			if (i-- == 0)
>  				return cpu;
> +		}
>  
> -		for_each_cpu(cpu, cpu_online_mask) {
> +		for_each_cpu(cpu, mask) {
>  			/* Skip NUMA nodes, done above. */
>  			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
>  				continue;
-- 
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2020-06-29 16:11   ` Nitesh Narayan Lal
@ 2020-07-01  0:32     ` Andrew Morton
  2020-07-01  0:47       ` Nitesh Narayan Lal
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2020-07-01  0:32 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: linux-kernel, linux-api, peterz, frederic, mtosatti, juri.lelli,
	abelits, bhelgaas, linux-pci, rostedt, mingo, tglx, davem, sfr,
	stephen, rppt, jinyuqi, zhangshaokun

On Mon, 29 Jun 2020 12:11:25 -0400 Nitesh Narayan Lal <nitesh@redhat.com> wrote:

> 
> On 6/25/20 6:34 PM, Nitesh Narayan Lal wrote:
> > From: Alex Belits <abelits@marvell.com>
> >
> > The current implementation of cpumask_local_spread() does not respect the
> > isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> > it will return it to the caller for pinning of its IRQ threads. Having
> > these unwanted IRQ threads on an isolated CPU adds up to a latency
> > overhead.
> >
> > Restrict the CPUs that are returned for spreading IRQs only to the
> > available housekeeping CPUs.
> >
> > Signed-off-by: Alex Belits <abelits@marvell.com>
> > Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> 
> Hi Peter,
> 
> I just realized that Yuqi jin's patch [1] that modifies cpumask_local_spread is
> lying in linux-next.
> Should I do a re-post by re-basing the patches on the top of linux-next?
> 
> [1]
> https://lore.kernel.org/lkml/1582768688-2314-1-git-send-email-zhangshaokun@hisilicon.com/

This patch has had some review difficulties and has been pending for
quite some time.  I suggest you base your work on mainline and that we
ask Yuqi jin to rebase on that, if I don't feel confident doing it,


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2020-07-01  0:32     ` Andrew Morton
@ 2020-07-01  0:47       ` Nitesh Narayan Lal
  0 siblings, 0 replies; 52+ messages in thread
From: Nitesh Narayan Lal @ 2020-07-01  0:47 UTC (permalink / raw)
  To: Andrew Morton, peterz
  Cc: linux-kernel, linux-api, frederic, mtosatti, juri.lelli, abelits,
	bhelgaas, linux-pci, rostedt, mingo, tglx, davem, sfr, stephen,
	rppt, jinyuqi, zhangshaokun


[-- Attachment #1.1: Type: text/plain, Size: 1478 bytes --]


On 6/30/20 8:32 PM, Andrew Morton wrote:
> On Mon, 29 Jun 2020 12:11:25 -0400 Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>> On 6/25/20 6:34 PM, Nitesh Narayan Lal wrote:
>>> From: Alex Belits <abelits@marvell.com>
>>>
>>> The current implementation of cpumask_local_spread() does not respect the
>>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>>> it will return it to the caller for pinning of its IRQ threads. Having
>>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>>> overhead.
>>>
>>> Restrict the CPUs that are returned for spreading IRQs only to the
>>> available housekeeping CPUs.
>>>
>>> Signed-off-by: Alex Belits <abelits@marvell.com>
>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> Hi Peter,
>>
>> I just realized that Yuqi jin's patch [1] that modifies cpumask_local_spread is
>> lying in linux-next.
>> Should I do a re-post by re-basing the patches on the top of linux-next?
>>
>> [1]
>> https://lore.kernel.org/lkml/1582768688-2314-1-git-send-email-zhangshaokun@hisilicon.com/
> This patch has had some review difficulties and has been pending for
> quite some time.  I suggest you base your work on mainline and that we
> ask Yuqi jin to rebase on that, if I don't feel confident doing it,
>

I see, in that case, it should be fine to go ahead with this patch-set as I
already based this on top of the latest master before posting.

-- 
Thanks
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [tip: sched/core] net: Restrict receive packets queuing to housekeeping CPUs
  2020-06-25 22:34 ` [Patch v4 3/3] net: Restrict receive packets queuing " Nitesh Narayan Lal
  2020-06-26 11:14   ` Peter Zijlstra
@ 2020-07-09  8:45   ` tip-bot2 for Alex Belits
  1 sibling, 0 replies; 52+ messages in thread
From: tip-bot2 for Alex Belits @ 2020-07-09  8:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alex Belits, Nitesh Narayan Lal, Peter Zijlstra (Intel), x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     07bbecb3410617816a99e76a2df7576507a0c8ad
Gitweb:        https://git.kernel.org/tip/07bbecb3410617816a99e76a2df7576507a0c8ad
Author:        Alex Belits <abelits@marvell.com>
AuthorDate:    Thu, 25 Jun 2020 18:34:43 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 08 Jul 2020 11:39:02 +02:00

net: Restrict receive packets queuing to housekeeping CPUs

With the existing implementation of store_rps_map(), packets are queued
in the receive path on the backlog queues of other CPUs irrespective of
whether they are isolated or not. This could add a latency overhead to
any RT workload that is running on the same CPU.

Ensure that store_rps_map() only uses available housekeeping CPUs for
storing the rps_map.

Signed-off-by: Alex Belits <abelits@marvell.com>
Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200625223443.2684-4-nitesh@redhat.com
---
 net/core/net-sysfs.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e353b82..677868f 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -11,6 +11,7 @@
 #include <linux/if_arp.h>
 #include <linux/slab.h>
 #include <linux/sched/signal.h>
+#include <linux/sched/isolation.h>
 #include <linux/nsproxy.h>
 #include <net/sock.h>
 #include <net/net_namespace.h>
@@ -741,7 +742,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 {
 	struct rps_map *old_map, *map;
 	cpumask_var_t mask;
-	int err, cpu, i;
+	int err, cpu, i, hk_flags;
 	static DEFINE_MUTEX(rps_map_mutex);
 
 	if (!capable(CAP_NET_ADMIN))
@@ -756,6 +757,13 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 		return err;
 	}
 
+	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
+	cpumask_and(mask, mask, housekeeping_cpumask(hk_flags));
+	if (cpumask_empty(mask)) {
+		free_cpumask_var(mask);
+		return -EINVAL;
+	}
+
 	map = kzalloc(max_t(unsigned int,
 			    RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES),
 		      GFP_KERNEL);

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [tip: sched/core] PCI: Restrict probe functions to housekeeping CPUs
  2020-06-25 22:34 ` [Patch v4 2/3] PCI: Restrict probe functions to housekeeping CPUs Nitesh Narayan Lal
@ 2020-07-09  8:45   ` tip-bot2 for Alex Belits
  0 siblings, 0 replies; 52+ messages in thread
From: tip-bot2 for Alex Belits @ 2020-07-09  8:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alex Belits, Nitesh Narayan Lal, Peter Zijlstra (Intel),
	Frederic Weisbecker, Bjorn Helgaas, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     69a18b18699b59654333651d95f8ca09d01048f8
Gitweb:        https://git.kernel.org/tip/69a18b18699b59654333651d95f8ca09d01048f8
Author:        Alex Belits <abelits@marvell.com>
AuthorDate:    Thu, 25 Jun 2020 18:34:42 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 08 Jul 2020 11:39:01 +02:00

PCI: Restrict probe functions to housekeeping CPUs

pci_call_probe() prevents the nesting of work_on_cpu() for a scenario
where a VF device is probed from work_on_cpu() of the PF.

Replace the cpumask used in pci_call_probe() from all online CPUs to only
housekeeping CPUs. This is to ensure that there are no additional latency
overheads caused due to the pinning of jobs on isolated CPUs.

Signed-off-by: Alex Belits <abelits@marvell.com>
Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lkml.kernel.org/r/20200625223443.2684-3-nitesh@redhat.com
---
 drivers/pci/pci-driver.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index da6510a..449466f 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -12,6 +12,7 @@
 #include <linux/string.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/sched/isolation.h>
 #include <linux/cpu.h>
 #include <linux/pm_runtime.h>
 #include <linux/suspend.h>
@@ -333,6 +334,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 			  const struct pci_device_id *id)
 {
 	int error, node, cpu;
+	int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
 	struct drv_dev_and_id ddi = { drv, dev, id };
 
 	/*
@@ -353,7 +355,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	    pci_physfn_is_probed(dev))
 		cpu = nr_cpu_ids;
 	else
-		cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
+		cpu = cpumask_any_and(cpumask_of_node(node),
+				      housekeeping_cpumask(hk_flags));
 
 	if (cpu < nr_cpu_ids)
 		error = work_on_cpu(cpu, local_pci_probe, &ddi);

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [tip: sched/core] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2020-06-25 22:34 ` [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs Nitesh Narayan Lal
  2020-06-29 16:11   ` Nitesh Narayan Lal
@ 2020-07-09  8:45   ` tip-bot2 for Alex Belits
  2021-01-27 11:57   ` [Patch v4 1/3] " Robin Murphy
  2 siblings, 0 replies; 52+ messages in thread
From: tip-bot2 for Alex Belits @ 2020-07-09  8:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Alex Belits, Nitesh Narayan Lal, Peter Zijlstra (Intel), x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     1abdfe706a579a702799fce465bceb9fb01d407c
Gitweb:        https://git.kernel.org/tip/1abdfe706a579a702799fce465bceb9fb01d407c
Author:        Alex Belits <abelits@marvell.com>
AuthorDate:    Thu, 25 Jun 2020 18:34:41 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 08 Jul 2020 11:39:01 +02:00

lib: Restrict cpumask_local_spread to houskeeping CPUs

The current implementation of cpumask_local_spread() does not respect the
isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
it will return it to the caller for pinning of its IRQ threads. Having
these unwanted IRQ threads on an isolated CPU adds up to a latency
overhead.

Restrict the CPUs that are returned for spreading IRQs only to the
available housekeeping CPUs.

Signed-off-by: Alex Belits <abelits@marvell.com>
Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200625223443.2684-2-nitesh@redhat.com
---
 lib/cpumask.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb2..85da6ab 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
 #include <linux/export.h>
 #include <linux/memblock.h>
 #include <linux/numa.h>
+#include <linux/sched/isolation.h>
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
  */
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-	int cpu;
+	int cpu, hk_flags;
+	const struct cpumask *mask;
 
+	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
+	mask = housekeeping_cpumask(hk_flags);
 	/* Wrap: we always want a cpu. */
-	i %= num_online_cpus();
+	i %= cpumask_weight(mask);
 
 	if (node == NUMA_NO_NODE) {
-		for_each_cpu(cpu, cpu_online_mask)
+		for_each_cpu(cpu, mask) {
 			if (i-- == 0)
 				return cpu;
+		}
 	} else {
 		/* NUMA first. */
-		for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
+		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
 			if (i-- == 0)
 				return cpu;
+		}
 
-		for_each_cpu(cpu, cpu_online_mask) {
+		for_each_cpu(cpu, mask) {
 			/* Skip NUMA nodes, done above. */
 			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
 				continue;

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2020-06-25 22:34 ` [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs Nitesh Narayan Lal
  2020-06-29 16:11   ` Nitesh Narayan Lal
  2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Alex Belits
@ 2021-01-27 11:57   ` Robin Murphy
  2021-01-27 12:19     ` Marcelo Tosatti
  2 siblings, 1 reply; 52+ messages in thread
From: Robin Murphy @ 2021-01-27 11:57 UTC (permalink / raw)
  To: Nitesh Narayan Lal, linux-kernel, linux-api, frederic, mtosatti,
	juri.lelli, abelits, bhelgaas, linux-pci, rostedt, mingo, peterz,
	tglx, davem, akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun

Hi,

On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
> From: Alex Belits <abelits@marvell.com>
> 
> The current implementation of cpumask_local_spread() does not respect the
> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> it will return it to the caller for pinning of its IRQ threads. Having
> these unwanted IRQ threads on an isolated CPU adds up to a latency
> overhead.
> 
> Restrict the CPUs that are returned for spreading IRQs only to the
> available housekeeping CPUs.
> 
> Signed-off-by: Alex Belits <abelits@marvell.com>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>   lib/cpumask.c | 16 +++++++++++-----
>   1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index fb22fb266f93..85da6ab4fbb5 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -6,6 +6,7 @@
>   #include <linux/export.h>
>   #include <linux/memblock.h>
>   #include <linux/numa.h>
> +#include <linux/sched/isolation.h>
>   
>   /**
>    * cpumask_next - get the next cpu in a cpumask
> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>    */
>   unsigned int cpumask_local_spread(unsigned int i, int node)
>   {
> -	int cpu;
> +	int cpu, hk_flags;
> +	const struct cpumask *mask;
>   
> +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> +	mask = housekeeping_cpumask(hk_flags);

AFAICS, this generally resolves to something based on cpu_possible_mask 
rather than cpu_online_mask as before, so could now potentially return 
an offline CPU. Was that an intentional change?

I was just looking at the current code since I had the rare presence of 
mind to check if something suitable already existed before I start 
open-coding "any online CPU, but local node preferred" logic for 
handling IRQ affinity in a driver - cpumask_local_spread() appears to be 
almost what I want (if a bit more heavyweight), if only it would 
actually guarantee an online CPU as the kerneldoc claims :(

Robin.

>   	/* Wrap: we always want a cpu. */
> -	i %= num_online_cpus();
> +	i %= cpumask_weight(mask);
>   
>   	if (node == NUMA_NO_NODE) {
> -		for_each_cpu(cpu, cpu_online_mask)
> +		for_each_cpu(cpu, mask) {
>   			if (i-- == 0)
>   				return cpu;
> +		}
>   	} else {
>   		/* NUMA first. */
> -		for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
> +		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>   			if (i-- == 0)
>   				return cpu;
> +		}
>   
> -		for_each_cpu(cpu, cpu_online_mask) {
> +		for_each_cpu(cpu, mask) {
>   			/* Skip NUMA nodes, done above. */
>   			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
>   				continue;
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-01-27 11:57   ` [Patch v4 1/3] " Robin Murphy
@ 2021-01-27 12:19     ` Marcelo Tosatti
  2021-01-27 12:36       ` Robin Murphy
  2021-01-28 16:02       ` Thomas Gleixner
  0 siblings, 2 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2021-01-27 12:19 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nitesh Narayan Lal, linux-kernel, linux-api, frederic,
	juri.lelli, abelits, bhelgaas, linux-pci, rostedt, mingo, peterz,
	tglx, davem, akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun

On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
> Hi,
> 
> On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
> > From: Alex Belits <abelits@marvell.com>
> > 
> > The current implementation of cpumask_local_spread() does not respect the
> > isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> > it will return it to the caller for pinning of its IRQ threads. Having
> > these unwanted IRQ threads on an isolated CPU adds up to a latency
> > overhead.
> > 
> > Restrict the CPUs that are returned for spreading IRQs only to the
> > available housekeeping CPUs.
> > 
> > Signed-off-by: Alex Belits <abelits@marvell.com>
> > Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > ---
> >   lib/cpumask.c | 16 +++++++++++-----
> >   1 file changed, 11 insertions(+), 5 deletions(-)
> > 
> > diff --git a/lib/cpumask.c b/lib/cpumask.c
> > index fb22fb266f93..85da6ab4fbb5 100644
> > --- a/lib/cpumask.c
> > +++ b/lib/cpumask.c
> > @@ -6,6 +6,7 @@
> >   #include <linux/export.h>
> >   #include <linux/memblock.h>
> >   #include <linux/numa.h>
> > +#include <linux/sched/isolation.h>
> >   /**
> >    * cpumask_next - get the next cpu in a cpumask
> > @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
> >    */
> >   unsigned int cpumask_local_spread(unsigned int i, int node)
> >   {
> > -	int cpu;
> > +	int cpu, hk_flags;
> > +	const struct cpumask *mask;
> > +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> > +	mask = housekeeping_cpumask(hk_flags);
> 
> AFAICS, this generally resolves to something based on cpu_possible_mask
> rather than cpu_online_mask as before, so could now potentially return an
> offline CPU. Was that an intentional change?

Robin,

AFAICS online CPUs should be filtered.

> I was just looking at the current code since I had the rare presence of mind
> to check if something suitable already existed before I start open-coding
> "any online CPU, but local node preferred" logic for handling IRQ affinity
> in a driver - cpumask_local_spread() appears to be almost what I want (if a
> bit more heavyweight), if only it would actually guarantee an online CPU as
> the kerneldoc claims :(
> 
> Robin.
> 
> >   	/* Wrap: we always want a cpu. */
> > -	i %= num_online_cpus();
> > +	i %= cpumask_weight(mask);
> >   	if (node == NUMA_NO_NODE) {
> > -		for_each_cpu(cpu, cpu_online_mask)
> > +		for_each_cpu(cpu, mask) {
> >   			if (i-- == 0)
> >   				return cpu;
> > +		}
> >   	} else {
> >   		/* NUMA first. */
> > -		for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
> > +		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> >   			if (i-- == 0)
> >   				return cpu;
> > +		}
> > -		for_each_cpu(cpu, cpu_online_mask) {
> > +		for_each_cpu(cpu, mask) {
> >   			/* Skip NUMA nodes, done above. */
> >   			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
> >   				continue;
> > 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-01-27 12:19     ` Marcelo Tosatti
@ 2021-01-27 12:36       ` Robin Murphy
  2021-01-27 13:09         ` Marcelo Tosatti
  2021-01-28 16:02       ` Thomas Gleixner
  1 sibling, 1 reply; 52+ messages in thread
From: Robin Murphy @ 2021-01-27 12:36 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Nitesh Narayan Lal, linux-kernel, linux-api, frederic,
	juri.lelli, abelits, bhelgaas, linux-pci, rostedt, mingo, peterz,
	tglx, davem, akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun

On 2021-01-27 12:19, Marcelo Tosatti wrote:
> On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
>> Hi,
>>
>> On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
>>> From: Alex Belits <abelits@marvell.com>
>>>
>>> The current implementation of cpumask_local_spread() does not respect the
>>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>>> it will return it to the caller for pinning of its IRQ threads. Having
>>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>>> overhead.
>>>
>>> Restrict the CPUs that are returned for spreading IRQs only to the
>>> available housekeeping CPUs.
>>>
>>> Signed-off-by: Alex Belits <abelits@marvell.com>
>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>> ---
>>>    lib/cpumask.c | 16 +++++++++++-----
>>>    1 file changed, 11 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>>> index fb22fb266f93..85da6ab4fbb5 100644
>>> --- a/lib/cpumask.c
>>> +++ b/lib/cpumask.c
>>> @@ -6,6 +6,7 @@
>>>    #include <linux/export.h>
>>>    #include <linux/memblock.h>
>>>    #include <linux/numa.h>
>>> +#include <linux/sched/isolation.h>
>>>    /**
>>>     * cpumask_next - get the next cpu in a cpumask
>>> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>>>     */
>>>    unsigned int cpumask_local_spread(unsigned int i, int node)
>>>    {
>>> -	int cpu;
>>> +	int cpu, hk_flags;
>>> +	const struct cpumask *mask;
>>> +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>> +	mask = housekeeping_cpumask(hk_flags);
>>
>> AFAICS, this generally resolves to something based on cpu_possible_mask
>> rather than cpu_online_mask as before, so could now potentially return an
>> offline CPU. Was that an intentional change?
> 
> Robin,
> 
> AFAICS online CPUs should be filtered.

Apologies if I'm being thick, but can you explain how? In the case of 
isolation being disabled or compiled out, housekeeping_cpumask() is 
literally just "return cpu_possible_mask;". If we then iterate over that 
with for_each_cpu() and just return the i'th possible CPU (e.g. in the 
NUMA_NO_NODE case), what guarantees that CPU is actually online?

Robin.

>> I was just looking at the current code since I had the rare presence of mind
>> to check if something suitable already existed before I start open-coding
>> "any online CPU, but local node preferred" logic for handling IRQ affinity
>> in a driver - cpumask_local_spread() appears to be almost what I want (if a
>> bit more heavyweight), if only it would actually guarantee an online CPU as
>> the kerneldoc claims :(
>>
>> Robin.
>>
>>>    	/* Wrap: we always want a cpu. */
>>> -	i %= num_online_cpus();
>>> +	i %= cpumask_weight(mask);
>>>    	if (node == NUMA_NO_NODE) {
>>> -		for_each_cpu(cpu, cpu_online_mask)
>>> +		for_each_cpu(cpu, mask) {
>>>    			if (i-- == 0)
>>>    				return cpu;
>>> +		}
>>>    	} else {
>>>    		/* NUMA first. */
>>> -		for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
>>> +		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>>>    			if (i-- == 0)
>>>    				return cpu;
>>> +		}
>>> -		for_each_cpu(cpu, cpu_online_mask) {
>>> +		for_each_cpu(cpu, mask) {
>>>    			/* Skip NUMA nodes, done above. */
>>>    			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
>>>    				continue;
>>>
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-01-27 12:36       ` Robin Murphy
@ 2021-01-27 13:09         ` Marcelo Tosatti
  2021-01-27 13:49           ` Robin Murphy
                             ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2021-01-27 13:09 UTC (permalink / raw)
  To: Robin Murphy, Thomas Gleixner
  Cc: Nitesh Narayan Lal, linux-kernel, linux-api, frederic,
	juri.lelli, abelits, bhelgaas, linux-pci, rostedt, mingo, peterz,
	tglx, davem, akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun

On Wed, Jan 27, 2021 at 12:36:30PM +0000, Robin Murphy wrote:
> On 2021-01-27 12:19, Marcelo Tosatti wrote:
> > On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
> > > Hi,
> > > 
> > > On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
> > > > From: Alex Belits <abelits@marvell.com>
> > > > 
> > > > The current implementation of cpumask_local_spread() does not respect the
> > > > isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> > > > it will return it to the caller for pinning of its IRQ threads. Having
> > > > these unwanted IRQ threads on an isolated CPU adds up to a latency
> > > > overhead.
> > > > 
> > > > Restrict the CPUs that are returned for spreading IRQs only to the
> > > > available housekeeping CPUs.
> > > > 
> > > > Signed-off-by: Alex Belits <abelits@marvell.com>
> > > > Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > > > ---
> > > >    lib/cpumask.c | 16 +++++++++++-----
> > > >    1 file changed, 11 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/lib/cpumask.c b/lib/cpumask.c
> > > > index fb22fb266f93..85da6ab4fbb5 100644
> > > > --- a/lib/cpumask.c
> > > > +++ b/lib/cpumask.c
> > > > @@ -6,6 +6,7 @@
> > > >    #include <linux/export.h>
> > > >    #include <linux/memblock.h>
> > > >    #include <linux/numa.h>
> > > > +#include <linux/sched/isolation.h>
> > > >    /**
> > > >     * cpumask_next - get the next cpu in a cpumask
> > > > @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
> > > >     */
> > > >    unsigned int cpumask_local_spread(unsigned int i, int node)
> > > >    {
> > > > -	int cpu;
> > > > +	int cpu, hk_flags;
> > > > +	const struct cpumask *mask;
> > > > +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> > > > +	mask = housekeeping_cpumask(hk_flags);
> > > 
> > > AFAICS, this generally resolves to something based on cpu_possible_mask
> > > rather than cpu_online_mask as before, so could now potentially return an
> > > offline CPU. Was that an intentional change?
> > 
> > Robin,
> > 
> > AFAICS online CPUs should be filtered.
> 
> Apologies if I'm being thick, but can you explain how? In the case of
> isolation being disabled or compiled out, housekeeping_cpumask() is
> literally just "return cpu_possible_mask;". If we then iterate over that
> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
> NUMA_NO_NODE case), what guarantees that CPU is actually online?
> 
> Robin.

Nothing, but that was the situation before 1abdfe706a579a702799fce465bceb9fb01d407c
as well.

cpumask_local_spread() should probably be disabling CPU hotplug.

Thomas?

> 
> > > I was just looking at the current code since I had the rare presence of mind
> > > to check if something suitable already existed before I start open-coding
> > > "any online CPU, but local node preferred" logic for handling IRQ affinity
> > > in a driver - cpumask_local_spread() appears to be almost what I want (if a
> > > bit more heavyweight), if only it would actually guarantee an online CPU as
> > > the kerneldoc claims :(
> > > 
> > > Robin.
> > > 
> > > >    	/* Wrap: we always want a cpu. */
> > > > -	i %= num_online_cpus();
> > > > +	i %= cpumask_weight(mask);
> > > >    	if (node == NUMA_NO_NODE) {
> > > > -		for_each_cpu(cpu, cpu_online_mask)
> > > > +		for_each_cpu(cpu, mask) {
> > > >    			if (i-- == 0)
> > > >    				return cpu;
> > > > +		}
> > > >    	} else {
> > > >    		/* NUMA first. */
> > > > -		for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
> > > > +		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> > > >    			if (i-- == 0)
> > > >    				return cpu;
> > > > +		}
> > > > -		for_each_cpu(cpu, cpu_online_mask) {
> > > > +		for_each_cpu(cpu, mask) {
> > > >    			/* Skip NUMA nodes, done above. */
> > > >    			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
> > > >    				continue;
> > > > 
> > 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-01-27 13:09         ` Marcelo Tosatti
@ 2021-01-27 13:49           ` Robin Murphy
  2021-01-27 14:16           ` Nitesh Narayan Lal
  2021-01-28 15:56           ` Thomas Gleixner
  2 siblings, 0 replies; 52+ messages in thread
From: Robin Murphy @ 2021-01-27 13:49 UTC (permalink / raw)
  To: Marcelo Tosatti, Thomas Gleixner
  Cc: Nitesh Narayan Lal, linux-kernel, linux-api, frederic,
	juri.lelli, abelits, bhelgaas, linux-pci, rostedt, mingo, peterz,
	davem, akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun

On 2021-01-27 13:09, Marcelo Tosatti wrote:
> On Wed, Jan 27, 2021 at 12:36:30PM +0000, Robin Murphy wrote:
>> On 2021-01-27 12:19, Marcelo Tosatti wrote:
>>> On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
>>>> Hi,
>>>>
>>>> On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
>>>>> From: Alex Belits <abelits@marvell.com>
>>>>>
>>>>> The current implementation of cpumask_local_spread() does not respect the
>>>>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>>>>> it will return it to the caller for pinning of its IRQ threads. Having
>>>>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>>>>> overhead.
>>>>>
>>>>> Restrict the CPUs that are returned for spreading IRQs only to the
>>>>> available housekeeping CPUs.
>>>>>
>>>>> Signed-off-by: Alex Belits <abelits@marvell.com>
>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>>>> ---
>>>>>     lib/cpumask.c | 16 +++++++++++-----
>>>>>     1 file changed, 11 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>>>>> index fb22fb266f93..85da6ab4fbb5 100644
>>>>> --- a/lib/cpumask.c
>>>>> +++ b/lib/cpumask.c
>>>>> @@ -6,6 +6,7 @@
>>>>>     #include <linux/export.h>
>>>>>     #include <linux/memblock.h>
>>>>>     #include <linux/numa.h>
>>>>> +#include <linux/sched/isolation.h>
>>>>>     /**
>>>>>      * cpumask_next - get the next cpu in a cpumask
>>>>> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>>>>>      */
>>>>>     unsigned int cpumask_local_spread(unsigned int i, int node)
>>>>>     {
>>>>> -	int cpu;
>>>>> +	int cpu, hk_flags;
>>>>> +	const struct cpumask *mask;
>>>>> +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>>>> +	mask = housekeeping_cpumask(hk_flags);
>>>>
>>>> AFAICS, this generally resolves to something based on cpu_possible_mask
>>>> rather than cpu_online_mask as before, so could now potentially return an
>>>> offline CPU. Was that an intentional change?
>>>
>>> Robin,
>>>
>>> AFAICS online CPUs should be filtered.
>>
>> Apologies if I'm being thick, but can you explain how? In the case of
>> isolation being disabled or compiled out, housekeeping_cpumask() is
>> literally just "return cpu_possible_mask;". If we then iterate over that
>> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
>> NUMA_NO_NODE case), what guarantees that CPU is actually online?
>>
>> Robin.
> 
> Nothing, but that was the situation before 1abdfe706a579a702799fce465bceb9fb01d407c
> as well.

True, if someone calls from a racy context then there's not much we can 
do to ensure that any CPU *remains* online after we initially observed 
it to be, but when it's called from somewhere safe like a cpuhp offline 
handler, then picking from cpu_online_mask *did* always do the right 
thing (by my interpretation), whereas picking from 
housekeeping_cpumask() might not.

This is why I decided to ask rather than just send a patch to fix what I 
think might be a bug - I have no objection if this *is* intended 
behaviour, other than suggesting we amend the "...selects an online 
CPU..." comment if that aspect was never meant to be relied upon.

Thanks,
Robin.

> 
> cpumask_local_spread() should probably be disabling CPU hotplug.
> 
> Thomas?
> 
>>
>>>> I was just looking at the current code since I had the rare presence of mind
>>>> to check if something suitable already existed before I start open-coding
>>>> "any online CPU, but local node preferred" logic for handling IRQ affinity
>>>> in a driver - cpumask_local_spread() appears to be almost what I want (if a
>>>> bit more heavyweight), if only it would actually guarantee an online CPU as
>>>> the kerneldoc claims :(
>>>>
>>>> Robin.
>>>>
>>>>>     	/* Wrap: we always want a cpu. */
>>>>> -	i %= num_online_cpus();
>>>>> +	i %= cpumask_weight(mask);
>>>>>     	if (node == NUMA_NO_NODE) {
>>>>> -		for_each_cpu(cpu, cpu_online_mask)
>>>>> +		for_each_cpu(cpu, mask) {
>>>>>     			if (i-- == 0)
>>>>>     				return cpu;
>>>>> +		}
>>>>>     	} else {
>>>>>     		/* NUMA first. */
>>>>> -		for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
>>>>> +		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
>>>>>     			if (i-- == 0)
>>>>>     				return cpu;
>>>>> +		}
>>>>> -		for_each_cpu(cpu, cpu_online_mask) {
>>>>> +		for_each_cpu(cpu, mask) {
>>>>>     			/* Skip NUMA nodes, done above. */
>>>>>     			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
>>>>>     				continue;
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-01-27 13:09         ` Marcelo Tosatti
  2021-01-27 13:49           ` Robin Murphy
@ 2021-01-27 14:16           ` Nitesh Narayan Lal
  2021-01-28 15:56           ` Thomas Gleixner
  2 siblings, 0 replies; 52+ messages in thread
From: Nitesh Narayan Lal @ 2021-01-27 14:16 UTC (permalink / raw)
  To: Marcelo Tosatti, Robin Murphy, Thomas Gleixner, frederic
  Cc: linux-kernel, linux-api, juri.lelli, abelits, bhelgaas,
	linux-pci, rostedt, mingo, peterz, davem, akpm, sfr, stephen,
	rppt, jinyuqi, zhangshaokun


On 1/27/21 8:09 AM, Marcelo Tosatti wrote:
> On Wed, Jan 27, 2021 at 12:36:30PM +0000, Robin Murphy wrote:
>> On 2021-01-27 12:19, Marcelo Tosatti wrote:
>>> On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
>>>> Hi,
>>>>
>>>> On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
>>>>> From: Alex Belits <abelits@marvell.com>
>>>>>
>>>>> The current implementation of cpumask_local_spread() does not respect the
>>>>> isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
>>>>> it will return it to the caller for pinning of its IRQ threads. Having
>>>>> these unwanted IRQ threads on an isolated CPU adds up to a latency
>>>>> overhead.
>>>>>
>>>>> Restrict the CPUs that are returned for spreading IRQs only to the
>>>>> available housekeeping CPUs.
>>>>>
>>>>> Signed-off-by: Alex Belits <abelits@marvell.com>
>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>>>> ---
>>>>>    lib/cpumask.c | 16 +++++++++++-----
>>>>>    1 file changed, 11 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>>>>> index fb22fb266f93..85da6ab4fbb5 100644
>>>>> --- a/lib/cpumask.c
>>>>> +++ b/lib/cpumask.c
>>>>> @@ -6,6 +6,7 @@
>>>>>    #include <linux/export.h>
>>>>>    #include <linux/memblock.h>
>>>>>    #include <linux/numa.h>
>>>>> +#include <linux/sched/isolation.h>
>>>>>    /**
>>>>>     * cpumask_next - get the next cpu in a cpumask
>>>>> @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>>>>>     */
>>>>>    unsigned int cpumask_local_spread(unsigned int i, int node)
>>>>>    {
>>>>> -	int cpu;
>>>>> +	int cpu, hk_flags;
>>>>> +	const struct cpumask *mask;
>>>>> +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>>>> +	mask = housekeeping_cpumask(hk_flags);
>>>> AFAICS, this generally resolves to something based on cpu_possible_mask
>>>> rather than cpu_online_mask as before, so could now potentially return an
>>>> offline CPU. Was that an intentional change?
>>> Robin,
>>>
>>> AFAICS online CPUs should be filtered.
>> Apologies if I'm being thick, but can you explain how? In the case of
>> isolation being disabled or compiled out, housekeeping_cpumask() is
>> literally just "return cpu_possible_mask;". If we then iterate over that
>> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
>> NUMA_NO_NODE case), what guarantees that CPU is actually online?
>>
>> Robin.
> Nothing, but that was the situation before 1abdfe706a579a702799fce465bceb9fb01d407c
> as well.

Marcelo, before the commit cpumask_local_spread, was in fact, relying on
cpu_online_mask as Robin mentioned.
The problem here is with housekeeping_cpumask which always relied on the
cpu_possible_mask.

>
> cpumask_local_spread() should probably be disabling CPU hotplug.


Yes and this should also be done at several other places in the drivers
which don't take CPU hotplug into account eg. at the time of vector
allocation.


-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-01-27 13:09         ` Marcelo Tosatti
  2021-01-27 13:49           ` Robin Murphy
  2021-01-27 14:16           ` Nitesh Narayan Lal
@ 2021-01-28 15:56           ` Thomas Gleixner
  2021-01-28 16:33             ` Marcelo Tosatti
       [not found]             ` <02ac9d85-7ddd-96da-1252-4663feea7c9f@marvell.com>
  2 siblings, 2 replies; 52+ messages in thread
From: Thomas Gleixner @ 2021-01-28 15:56 UTC (permalink / raw)
  To: Marcelo Tosatti, Robin Murphy
  Cc: Nitesh Narayan Lal, linux-kernel, linux-api, frederic,
	juri.lelli, abelits, bhelgaas, linux-pci, rostedt, mingo, peterz,
	davem, akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun

On Wed, Jan 27 2021 at 10:09, Marcelo Tosatti wrote:
> On Wed, Jan 27, 2021 at 12:36:30PM +0000, Robin Murphy wrote:
>> > > >    /**
>> > > >     * cpumask_next - get the next cpu in a cpumask
>> > > > @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
>> > > >     */
>> > > >    unsigned int cpumask_local_spread(unsigned int i, int node)
>> > > >    {
>> > > > -	int cpu;
>> > > > +	int cpu, hk_flags;
>> > > > +	const struct cpumask *mask;
>> > > > +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>> > > > +	mask = housekeeping_cpumask(hk_flags);
>> > > 
>> > > AFAICS, this generally resolves to something based on cpu_possible_mask
>> > > rather than cpu_online_mask as before, so could now potentially return an
>> > > offline CPU. Was that an intentional change?
>> > 
>> > Robin,
>> > 
>> > AFAICS online CPUs should be filtered.
>> 
>> Apologies if I'm being thick, but can you explain how? In the case of
>> isolation being disabled or compiled out, housekeeping_cpumask() is
>> literally just "return cpu_possible_mask;". If we then iterate over that
>> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
>> NUMA_NO_NODE case), what guarantees that CPU is actually online?
>> 
>> Robin.
>
> Nothing, but that was the situation before 1abdfe706a579a702799fce465bceb9fb01d407c
> as well.
>
> cpumask_local_spread() should probably be disabling CPU hotplug.

It can't unless all callers are from preemtible code.

Aside of that this whole frenzy to sprinkle housekeeping_cpumask() all
over the kernel is just wrong, really.

As I explained several times before there are very valid reasons for
having queues and interrupts on isolated CPUs. Just optimizing for the
usecases some people care about is not making anything better.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-01-27 12:19     ` Marcelo Tosatti
  2021-01-27 12:36       ` Robin Murphy
@ 2021-01-28 16:02       ` Thomas Gleixner
  2021-01-28 16:59         ` Marcelo Tosatti
  1 sibling, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2021-01-28 16:02 UTC (permalink / raw)
  To: Marcelo Tosatti, Robin Murphy
  Cc: Nitesh Narayan Lal, linux-kernel, linux-api, frederic,
	juri.lelli, abelits, bhelgaas, linux-pci, rostedt, mingo, peterz,
	davem, akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun

On Wed, Jan 27 2021 at 09:19, Marcelo Tosatti wrote:
> On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
>> > +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>> > +	mask = housekeeping_cpumask(hk_flags);
>> 
>> AFAICS, this generally resolves to something based on cpu_possible_mask
>> rather than cpu_online_mask as before, so could now potentially return an
>> offline CPU. Was that an intentional change?
>
> Robin,
>
> AFAICS online CPUs should be filtered.

The whole pile wants to be reverted. It's simply broken in several ways.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-01-28 15:56           ` Thomas Gleixner
@ 2021-01-28 16:33             ` Marcelo Tosatti
       [not found]             ` <02ac9d85-7ddd-96da-1252-4663feea7c9f@marvell.com>
  1 sibling, 0 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2021-01-28 16:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Robin Murphy, Nitesh Narayan Lal, linux-kernel, linux-api,
	frederic, juri.lelli, abelits, bhelgaas, linux-pci, rostedt,
	mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun

On Thu, Jan 28, 2021 at 04:56:07PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 27 2021 at 10:09, Marcelo Tosatti wrote:
> > On Wed, Jan 27, 2021 at 12:36:30PM +0000, Robin Murphy wrote:
> >> > > >    /**
> >> > > >     * cpumask_next - get the next cpu in a cpumask
> >> > > > @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
> >> > > >     */
> >> > > >    unsigned int cpumask_local_spread(unsigned int i, int node)
> >> > > >    {
> >> > > > -	int cpu;
> >> > > > +	int cpu, hk_flags;
> >> > > > +	const struct cpumask *mask;
> >> > > > +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> >> > > > +	mask = housekeeping_cpumask(hk_flags);
> >> > > 
> >> > > AFAICS, this generally resolves to something based on cpu_possible_mask
> >> > > rather than cpu_online_mask as before, so could now potentially return an
> >> > > offline CPU. Was that an intentional change?
> >> > 
> >> > Robin,
> >> > 
> >> > AFAICS online CPUs should be filtered.
> >> 
> >> Apologies if I'm being thick, but can you explain how? In the case of
> >> isolation being disabled or compiled out, housekeeping_cpumask() is
> >> literally just "return cpu_possible_mask;". If we then iterate over that
> >> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
> >> NUMA_NO_NODE case), what guarantees that CPU is actually online?
> >> 
> >> Robin.
> >
> > Nothing, but that was the situation before 1abdfe706a579a702799fce465bceb9fb01d407c
> > as well.
> >
> > cpumask_local_spread() should probably be disabling CPU hotplug.
> 
> It can't unless all callers are from preemtible code.
> 
> Aside of that this whole frenzy to sprinkle housekeeping_cpumask() all
> over the kernel is just wrong, really.
> 
> As I explained several times before there are very valid reasons for
> having queues and interrupts on isolated CPUs. Just optimizing for the
> usecases some people care about is not making anything better.

And that is right.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-01-28 16:02       ` Thomas Gleixner
@ 2021-01-28 16:59         ` Marcelo Tosatti
  2021-01-28 17:35           ` Nitesh Narayan Lal
  2021-01-28 20:01           ` Thomas Gleixner
  0 siblings, 2 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2021-01-28 16:59 UTC (permalink / raw)
  To: Thomas Gleixner, Nitesh Narayan Lal
  Cc: Robin Murphy, Nitesh Narayan Lal, linux-kernel, linux-api,
	frederic, juri.lelli, abelits, bhelgaas, linux-pci, rostedt,
	mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun

On Thu, Jan 28, 2021 at 05:02:41PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 27 2021 at 09:19, Marcelo Tosatti wrote:
> > On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
> >> > +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> >> > +	mask = housekeeping_cpumask(hk_flags);
> >> 
> >> AFAICS, this generally resolves to something based on cpu_possible_mask
> >> rather than cpu_online_mask as before, so could now potentially return an
> >> offline CPU. Was that an intentional change?
> >
> > Robin,
> >
> > AFAICS online CPUs should be filtered.
> 
> The whole pile wants to be reverted. It's simply broken in several ways.

I was asking for your comments on interaction with CPU hotplug :-)
Anyway...

So housekeeping_cpumask has multiple meanings. In this case:

HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ

     domain
       Isolate from the general SMP balancing and scheduling
       algorithms. Note that performing domain isolation this way
       is irreversible: it's not possible to bring back a CPU to
       the domains once isolated through isolcpus. It's strongly
       advised to use cpusets instead to disable scheduler load
       balancing through the "cpuset.sched_load_balance" file.
       It offers a much more flexible interface where CPUs can
       move in and out of an isolated set anytime.

       You can move a process onto or off an "isolated" CPU via
       the CPU affinity syscalls or cpuset.
       <cpu number> begins at 0 and the maximum value is
       "number of CPUs in system - 1".

     managed_irq

       Isolate from being targeted by managed interrupts
       which have an interrupt mask containing isolated
       CPUs. The affinity of managed interrupts is
       handled by the kernel and cannot be changed via
       the /proc/irq/* interfaces.

       This isolation is best effort and only effective
       if the automatically assigned interrupt mask of a
       device queue contains isolated and housekeeping
       CPUs. If housekeeping CPUs are online then such
       interrupts are directed to the housekeeping CPU
       so that IO submitted on the housekeeping CPU
       cannot disturb the isolated CPU.

       If a queue's affinity mask contains only isolated
       CPUs then this parameter has no effect on the
       interrupt routing decision, though interrupts are
       only delivered when tasks running on those
       isolated CPUs submit IO. IO submitted on
       housekeeping CPUs has no influence on those
       queues.

So as long as the meaning of the flags are respected, seems
alright.

Nitesh, is there anything preventing this from being fixed
in userspace ? (as Thomas suggested previously).




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-01-28 16:59         ` Marcelo Tosatti
@ 2021-01-28 17:35           ` Nitesh Narayan Lal
  2021-01-28 20:01           ` Thomas Gleixner
  1 sibling, 0 replies; 52+ messages in thread
From: Nitesh Narayan Lal @ 2021-01-28 17:35 UTC (permalink / raw)
  To: Marcelo Tosatti, Thomas Gleixner
  Cc: Robin Murphy, linux-kernel, linux-api, frederic, juri.lelli,
	abelits, bhelgaas, linux-pci, rostedt, mingo, peterz, davem,
	akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun


On 1/28/21 11:59 AM, Marcelo Tosatti wrote:
> On Thu, Jan 28, 2021 at 05:02:41PM +0100, Thomas Gleixner wrote:
>> On Wed, Jan 27 2021 at 09:19, Marcelo Tosatti wrote:
>>> On Wed, Jan 27, 2021 at 11:57:16AM +0000, Robin Murphy wrote:
>>>>> +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
>>>>> +	mask = housekeeping_cpumask(hk_flags);
>>>> AFAICS, this generally resolves to something based on cpu_possible_mask
>>>> rather than cpu_online_mask as before, so could now potentially return an
>>>> offline CPU. Was that an intentional change?
>>> Robin,
>>>
>>> AFAICS online CPUs should be filtered.
>> The whole pile wants to be reverted. It's simply broken in several ways.
> I was asking for your comments on interaction with CPU hotplug :-)
> Anyway...
>
> So housekeeping_cpumask has multiple meanings. In this case:
>
> HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ
>
>      domain
>        Isolate from the general SMP balancing and scheduling
>        algorithms. Note that performing domain isolation this way
>        is irreversible: it's not possible to bring back a CPU to
>        the domains once isolated through isolcpus. It's strongly
>        advised to use cpusets instead to disable scheduler load
>        balancing through the "cpuset.sched_load_balance" file.
>        It offers a much more flexible interface where CPUs can
>        move in and out of an isolated set anytime.
>
>        You can move a process onto or off an "isolated" CPU via
>        the CPU affinity syscalls or cpuset.
>        <cpu number> begins at 0 and the maximum value is
>        "number of CPUs in system - 1".
>
>      managed_irq
>
>        Isolate from being targeted by managed interrupts
>        which have an interrupt mask containing isolated
>        CPUs. The affinity of managed interrupts is
>        handled by the kernel and cannot be changed via
>        the /proc/irq/* interfaces.
>
>        This isolation is best effort and only effective
>        if the automatically assigned interrupt mask of a
>        device queue contains isolated and housekeeping
>        CPUs. If housekeeping CPUs are online then such
>        interrupts are directed to the housekeeping CPU
>        so that IO submitted on the housekeeping CPU
>        cannot disturb the isolated CPU.
>
>        If a queue's affinity mask contains only isolated
>        CPUs then this parameter has no effect on the
>        interrupt routing decision, though interrupts are
>        only delivered when tasks running on those
>        isolated CPUs submit IO. IO submitted on
>        housekeeping CPUs has no influence on those
>        queues.
>
> So as long as the meaning of the flags are respected, seems
> alright.
>
> Nitesh, is there anything preventing this from being fixed
> in userspace ? (as Thomas suggested previously).

I think it should be doable atleast for most of the devices.
However, I do wonder if there is a better way of fixing this generically
from the kernel?

Currently, as Thomas mentioned housekeeping_cpumask() is used at different
locations just to fix the issue corresponding to that component or driver.

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-01-28 16:59         ` Marcelo Tosatti
  2021-01-28 17:35           ` Nitesh Narayan Lal
@ 2021-01-28 20:01           ` Thomas Gleixner
       [not found]             ` <d2a4dc97-a9ed-e0e7-3b9c-c56ae46f6608@redhat.com>
  2021-02-04 18:15             ` Marcelo Tosatti
  1 sibling, 2 replies; 52+ messages in thread
From: Thomas Gleixner @ 2021-01-28 20:01 UTC (permalink / raw)
  To: Marcelo Tosatti, Nitesh Narayan Lal
  Cc: Robin Murphy, Nitesh Narayan Lal, linux-kernel, linux-api,
	frederic, juri.lelli, abelits, bhelgaas, linux-pci, rostedt,
	mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun

On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
>> The whole pile wants to be reverted. It's simply broken in several ways.
>
> I was asking for your comments on interaction with CPU hotplug :-)

Which I answered in an seperate mail :)

> So housekeeping_cpumask has multiple meanings. In this case:

...

> So as long as the meaning of the flags are respected, seems
> alright.

Yes. Stuff like the managed interrupts preference for housekeeping CPUs
when a affinity mask spawns housekeeping and isolated is perfectly
fine. It's well thought out and has no limitations.

> Nitesh, is there anything preventing this from being fixed
> in userspace ? (as Thomas suggested previously).

Everything with is not managed can be steered by user space.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [EXT] Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
       [not found]               ` <20210129142356.GB40876@fuller.cnet>
@ 2021-01-29 17:34                 ` Alex Belits
       [not found]                 ` <18584612-868c-0f88-5de2-dc93c8638816@redhat.com>
  1 sibling, 0 replies; 52+ messages in thread
From: Alex Belits @ 2021-01-29 17:34 UTC (permalink / raw)
  To: Marcelo Tosatti, Nitesh Narayan Lal
  Cc: Thomas Gleixner, Robin Murphy, linux-kernel, linux-api, frederic,
	juri.lelli, bhelgaas, linux-pci, rostedt, mingo, peterz, davem,
	akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun

On 1/29/21 06:23, Marcelo Tosatti wrote:
> External Email
> 
> ----------------------------------------------------------------------
> On Fri, Jan 29, 2021 at 08:55:20AM -0500, Nitesh Narayan Lal wrote:
>>
>> On 1/28/21 3:01 PM, Thomas Gleixner wrote:
>>> On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
>>>>> The whole pile wants to be reverted. It's simply broken in several ways.
>>>> I was asking for your comments on interaction with CPU hotplug :-)
>>> Which I answered in an seperate mail :)
>>>
>>>> So housekeeping_cpumask has multiple meanings. In this case:
>>> ...
>>>
>>>> So as long as the meaning of the flags are respected, seems
>>>> alright.
>>> Yes. Stuff like the managed interrupts preference for housekeeping CPUs
>>> when a affinity mask spawns housekeeping and isolated is perfectly
>>> fine. It's well thought out and has no limitations.
>>>
>>>> Nitesh, is there anything preventing this from being fixed
>>>> in userspace ? (as Thomas suggested previously).
>>> Everything with is not managed can be steered by user space.
>>>
>>> Thanks,
>>>
>>>          tglx
>>>
>>
>> So, I think the conclusion here would be to revert the change made in
>> cpumask_local_spread via the patch:
>>   - lib: Restrict cpumask_local_spread to housekeeping CPUs
>>
>> Also, a similar case can be made for the rps patch that went in with
>> this:
>>  - net: Restrict receive packets queuing to housekeeping CPUs
> 
> Yes, this is the userspace solution:
> 
> https://lkml.org/lkml/2021/1/22/815
> 
> Should have a kernel document with this info and examples
> (the network queue configuration as well). Will
> send something.
> 
>>  + net: accept an empty mask in /sys/class/net/*/queues/rx-*/rps_cpus
>>
>> I am not sure about the PCI patch as I don't think we can control that from
>> the userspace or maybe I am wrong?
> 
> You mean "lib: Restrict cpumask_local_spread to housekeeping CPUs" ?
> 

If we want to do it from userspace, we should have something that 
triggers it in userspace. Should we use udev for this purpose?

-- 
Alex

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [EXT] Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
       [not found]             ` <02ac9d85-7ddd-96da-1252-4663feea7c9f@marvell.com>
@ 2021-02-01 17:50               ` Marcelo Tosatti
  0 siblings, 0 replies; 52+ messages in thread
From: Marcelo Tosatti @ 2021-02-01 17:50 UTC (permalink / raw)
  To: Alex Belits
  Cc: Thomas Gleixner, Robin Murphy, Nitesh Narayan Lal, linux-kernel,
	linux-api, frederic, juri.lelli, bhelgaas, linux-pci, rostedt,
	mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun

On Fri, Jan 29, 2021 at 07:41:27AM -0800, Alex Belits wrote:
> On 1/28/21 07:56, Thomas Gleixner wrote:
> > External Email
> > 
> > ----------------------------------------------------------------------
> > On Wed, Jan 27 2021 at 10:09, Marcelo Tosatti wrote:
> > > On Wed, Jan 27, 2021 at 12:36:30PM +0000, Robin Murphy wrote:
> > > > > > >     /**
> > > > > > >      * cpumask_next - get the next cpu in a cpumask
> > > > > > > @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
> > > > > > >      */
> > > > > > >     unsigned int cpumask_local_spread(unsigned int i, int node)
> > > > > > >     {
> > > > > > > -	int cpu;
> > > > > > > +	int cpu, hk_flags;
> > > > > > > +	const struct cpumask *mask;
> > > > > > > +	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> > > > > > > +	mask = housekeeping_cpumask(hk_flags);
> > > > > > 
> > > > > > AFAICS, this generally resolves to something based on cpu_possible_mask
> > > > > > rather than cpu_online_mask as before, so could now potentially return an
> > > > > > offline CPU. Was that an intentional change?
> > > > > 
> > > > > Robin,
> > > > > 
> > > > > AFAICS online CPUs should be filtered.
> > > > 
> > > > Apologies if I'm being thick, but can you explain how? In the case of
> > > > isolation being disabled or compiled out, housekeeping_cpumask() is
> > > > literally just "return cpu_possible_mask;". If we then iterate over that
> > > > with for_each_cpu() and just return the i'th possible CPU (e.g. in the
> > > > NUMA_NO_NODE case), what guarantees that CPU is actually online?
> > > > 
> > > > Robin.
> > > 
> > > Nothing, but that was the situation before 1abdfe706a579a702799fce465bceb9fb01d407c
> > > as well.
> > > 
> > > cpumask_local_spread() should probably be disabling CPU hotplug.
> > 
> > It can't unless all callers are from preemtible code.
> > 
> > Aside of that this whole frenzy to sprinkle housekeeping_cpumask() all
> > over the kernel is just wrong, really.
> > 
> > As I explained several times before there are very valid reasons for
> > having queues and interrupts on isolated CPUs. Just optimizing for the
> > usecases some people care about is not making anything better.
> 
> However making it mandatory for isolated CPUs to allow interrupts is not a
> good idea, either. Providing an environment free of disturbances is a valid
> goal, so we can't do something that will make it impossible to achieve. We
> know that both there is a great amount of demand for this feature and
> implementing it is doable, so cutting off the possibility of development in
> this direction would be bad.
> 
> Before there was housekeeping mask, I had to implement another, more
> cumbersome model that ended up being more intrusive than I wanted. That was
> one of the reasons why I have spent some time working on it in, please
> forgive me the pun, isolation.
> 
> I was relieved when housekeeping mask appeared, and I was able to remove a
> large chunk of code that distinguished between CPUs that "are there" and
> CPUs "available to run work". Housekeeping is supposed to define the set of
> CPUs that are intended to run work that is not specifically triggered by
> anything running on those CPUs. "CPUs that are there" are CPUs that are
> being maintained as a part of the system, so they are usable for running
> things on them.
> 
> My idea at the time was that we can separate this into two directions of
> development:
> 
> 1. Make sure that housekeeping mask applies to all kinds of work that
> appears on CPUs, so nothing random will end up running there. Because this
> is very much in line of what it does.

Its easier to specify "all members of set" rather than having to specify each
individual member. Thinking of set as a description of types of
activities that should not have a given CPU as a target.

> 2. Rely on housekeeping mask to exclude anything not specifically intended
> to run on isolated CPUs, and concentrate efforts on making sure that things
> that are intended to [eventually] happen on those CPUs are handled properly
> -- in case of my recent proposals, delayed until synchronization event at
> the next kernel entry.

Your point makes sense, but the "isolated CPU + one interrupt to this CPU"
use-case is a valid one.

"target CPU" is the CPU we are interested in executing the application
(and handling the interrupt). So isolation happens today in the
following components:

        1)  setup kernel boot parameters for isolation (suppose
            the goal is to move these to happen dynamically in
            userspace, so they would move to step 2 in the future).

        2)  disable certain activities from happening at target CPU: "exclude ALL
            activities for target CPU which can be performed by
            housekeeping CPUs" (which might involve a
            number of steps in userspace such as
            the RPS mask configuration, network queue configuration
            to avoid vector exhaustion, removal of other activities
            on target CPU).

        3) userspace informs status of execution of isolated
           application (via the prctl interface, the QUIESCE step).
           From this point on interruptions (not as in IRQs but as in
	   "isolation breaking events") should be either ignored
           or logged.

> If we want to have some kind of greater flexibility of CPU hotplug
> management, more runtime control of isolation and housekeeping masks, or

Well i think any activity which can possibly cause an interruption
should be blocked from happening in the kernel. Which can be done as
soon as the

        "if (cpu_is_isolated(cpu))"

Logic can be integrated.

> more specific definition of what housekeeping mask does, we still have to
> implement the intended functionality. Something still has to determine,
> where we should not place delayed work, interrupts, etc. for the purpose of
> distributing this kind of work.

Yes, and today the isolation is being performed:

1) At boot time.
2) At isolation configuration (userspace). We have been trying to
centralize this in tuned profiles:

* With isolcpus=domain,: https://github.com/redhat-performance/tuned/blob/master/profiles/realtime-virtual-host/realtime-virtual-host-variables.conf
* With cgroups sets / systemd affinity masks: https://github.com/redhat-performance/tuned/blob/master/profiles/cpu-partitioning/tuned.conf

> As far as I can tell, the real underlying problem is that if anything
> changes (hotplug happens, CPUs are added or removed from hotplug mask,
> etc.), for absolutely correct solution in the most general case we would
> have to modify all those masks that we have determined for various purposes
> in various pieces of code and re-apply them. So, say, RPS implemented in a
> running driver will be reconfigured to remove CPUs that are no longer there,
> or CPUs that we now no longer want in "housekeeping" or "available for RPS,
> delayed work, or something similar" mask.

We discussed separating housekeeping_mask in per feature masks here:

https://lkml.org/lkml/2020/9/3/1067

So that userspace could control each feature individually.

> To do that (and for full and complete support of hotplug we will have to do
> that), we would have to reconfigure all users of CPU masks on hotplug and on
> whatever other events that reassign CPUs purpose. 

Good point.

-> Isolate CPU event: steps 1 and 2 above.
-> Unisolate CPU event: undo steps 1 and 2 above (considering the
boot-time limitation of 1).

-> Network device creation: Currently the proposed solution seems to be 
https://lkml.org/lkml/2021/1/22/815, which requires userspace
code modification (still missing this case).

-> CPU hotplug/hotunplug: unclear. 

> This may be eventually
> done, however I don't know if it's a good idea to implement it at once, the
> task of "adjusting all consumers of feature X" usually takes a while, and
> even proposing such thing imposes a great amount of work on many people.

Agree. CPU hotplug for example is not a feature which is very common 
on this type of deployments. Having an initial implementation 
which is suitable for the use cases in production seems OK.

> However we can make a decision, if we really want to go into that direction,
> and provide a way of handling a general event "CPUs are being repurposed,
> now adjust whatever is being sent to them". It's my opinion that it would be
> a good idea, we just can't expect it to be adopted overnight, so we can't
> make it mandatory.
> 
> However if we don't want to block development of "full isolation" for a very
> long time over this, we can have some minimal implementation available
> before there will be a decision and implementation of completely dynamic
> handling of "what can be done on what CPU".

Agree: the dynamic handling can be done as a later step.

> So maybe we have to implement something much more simple. For CPU hotplug it
> may be something like, "CPU coming online should not be added to
> housekeeping mask unless full isolation is disabled". 

Not sure i get this. 

> Then we can add a
> procedure for adding a non-housekeeping CPU to housekeeping mask. And that
> procedure will fail if CPU is isolated. 
>
> If we will have dynamic
> configuration of isolated CPUs we can make it necessary to remove CPU from
> isolated mask and add it to housekeeping after that. That will never produce
> any problems. The worst that will happen, some users of CPU masks will miss
> an opportunity to add more CPUs to whatever they use.

The usecase here seems to be hardware with a single OS running both
isolated and non-isolated workloads (say a latency sensitive
application, using multiple cores, and it can scale up/down on 
the number of cores depending on load). On the remaining CPUs 
one can use non-latency sensitive workloads.

Does that make sense?

> Then there will be a problem, what to do when we want to remove CPU from
> housekeeping or maybe this new category "CPUs that are available for various
> things". Initially we can just allow it only on hotplug.

Fail to parse sentence. Can you expand on "CPUs that are available for
various things?" 

> If we will want to implement removing a CPU from housekeeping and adding it
> to become available for isolation in general, we will have to do something
> new. It will probably look like a different kind of hotplug.

Hum, not sure why, have you read the discussion mentioned above, about 
the sysfs interface to control housekeeping?

> Maybe it's a bad idea to have only one kind of "housekeeping". 

Yes! That seems to be the same conclusion on the thread.

Forgot to CC you here:
https://www.spinics.net/lists/kernel/msg3814616.html

Can probably drop patch 7/9 from your series once this is
integrated.

> As I said
> above, we may choose to implement "CPUs available for work intended to be
> spread over multiple CPUs". Or maybe there should be "CPUs available for
> long-term running kernel tasks that are not considered housekeeping". And we
> can have separate "CPUs that are allowed to handle timers". There may be
> reasons for supporting this, other than isolation -- for example, big.LITTLE
> or new kinds of power management policies. But then, of course, we will have
> to consider more possible uses for those features, and adjust them for those
> purposes.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-01-28 20:01           ` Thomas Gleixner
       [not found]             ` <d2a4dc97-a9ed-e0e7-3b9c-c56ae46f6608@redhat.com>
@ 2021-02-04 18:15             ` Marcelo Tosatti
  2021-02-04 18:47               ` Nitesh Narayan Lal
  1 sibling, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2021-02-04 18:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nitesh Narayan Lal, Robin Murphy, linux-kernel, linux-api,
	frederic, juri.lelli, abelits, bhelgaas, linux-pci, rostedt,
	mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun

On Thu, Jan 28, 2021 at 09:01:37PM +0100, Thomas Gleixner wrote:
> On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
> >> The whole pile wants to be reverted. It's simply broken in several ways.
> >
> > I was asking for your comments on interaction with CPU hotplug :-)
> 
> Which I answered in an seperate mail :)
> 
> > So housekeeping_cpumask has multiple meanings. In this case:
> 
> ...
> 
> > So as long as the meaning of the flags are respected, seems
> > alright.
> 
> Yes. Stuff like the managed interrupts preference for housekeeping CPUs
> when a affinity mask spawns housekeeping and isolated is perfectly
> fine. It's well thought out and has no limitations.
> 
> > Nitesh, is there anything preventing this from being fixed
> > in userspace ? (as Thomas suggested previously).
> 
> Everything with is not managed can be steered by user space.

Yes, but it seems to be racy (that is, there is a window where the 
interrupt can be delivered to an isolated CPU).

ethtool ->
xgbe_set_channels ->
xgbe_full_restart_dev ->
xgbe_alloc_memory ->
xgbe_alloc_channels ->
cpumask_local_spread

Also ifconfig eth0 down / ifconfig eth0 up leads
to cpumask_spread_local.

How about adding a new flag for isolcpus instead?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-02-04 18:15             ` Marcelo Tosatti
@ 2021-02-04 18:47               ` Nitesh Narayan Lal
  2021-02-04 19:06                 ` Marcelo Tosatti
  0 siblings, 1 reply; 52+ messages in thread
From: Nitesh Narayan Lal @ 2021-02-04 18:47 UTC (permalink / raw)
  To: Marcelo Tosatti, Thomas Gleixner
  Cc: Robin Murphy, linux-kernel, linux-api, frederic, juri.lelli,
	abelits, bhelgaas, linux-pci, rostedt, mingo, peterz, davem,
	akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun


On 2/4/21 1:15 PM, Marcelo Tosatti wrote:
> On Thu, Jan 28, 2021 at 09:01:37PM +0100, Thomas Gleixner wrote:
>> On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
>>>> The whole pile wants to be reverted. It's simply broken in several ways.
>>> I was asking for your comments on interaction with CPU hotplug :-)
>> Which I answered in an seperate mail :)
>>
>>> So housekeeping_cpumask has multiple meanings. In this case:
>> ...
>>
>>> So as long as the meaning of the flags are respected, seems
>>> alright.
>> Yes. Stuff like the managed interrupts preference for housekeeping CPUs
>> when a affinity mask spawns housekeeping and isolated is perfectly
>> fine. It's well thought out and has no limitations.
>>
>>> Nitesh, is there anything preventing this from being fixed
>>> in userspace ? (as Thomas suggested previously).
>> Everything with is not managed can be steered by user space.
> Yes, but it seems to be racy (that is, there is a window where the 
> interrupt can be delivered to an isolated CPU).
>
> ethtool ->
> xgbe_set_channels ->
> xgbe_full_restart_dev ->
> xgbe_alloc_memory ->
> xgbe_alloc_channels ->
> cpumask_local_spread
>
> Also ifconfig eth0 down / ifconfig eth0 up leads
> to cpumask_spread_local.

There's always that possibility.

We have to ensure that we move the IRQs by a tuned daemon or some other
userspace script every time there is a net-dev change (eg. device comes up,
creates VFs, etc).

> How about adding a new flag for isolcpus instead?
>

Do you mean a flag based on which we can switch the affinity mask to
housekeeping for all the devices at the time of IRQ distribution?

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-02-04 18:47               ` Nitesh Narayan Lal
@ 2021-02-04 19:06                 ` Marcelo Tosatti
  2021-02-04 19:17                   ` Nitesh Narayan Lal
  0 siblings, 1 reply; 52+ messages in thread
From: Marcelo Tosatti @ 2021-02-04 19:06 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Thomas Gleixner, Robin Murphy, linux-kernel, linux-api, frederic,
	juri.lelli, abelits, bhelgaas, linux-pci, rostedt, mingo, peterz,
	davem, akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun

On Thu, Feb 04, 2021 at 01:47:38PM -0500, Nitesh Narayan Lal wrote:
> 
> On 2/4/21 1:15 PM, Marcelo Tosatti wrote:
> > On Thu, Jan 28, 2021 at 09:01:37PM +0100, Thomas Gleixner wrote:
> >> On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
> >>>> The whole pile wants to be reverted. It's simply broken in several ways.
> >>> I was asking for your comments on interaction with CPU hotplug :-)
> >> Which I answered in an seperate mail :)
> >>
> >>> So housekeeping_cpumask has multiple meanings. In this case:
> >> ...
> >>
> >>> So as long as the meaning of the flags are respected, seems
> >>> alright.
> >> Yes. Stuff like the managed interrupts preference for housekeeping CPUs
> >> when a affinity mask spawns housekeeping and isolated is perfectly
> >> fine. It's well thought out and has no limitations.
> >>
> >>> Nitesh, is there anything preventing this from being fixed
> >>> in userspace ? (as Thomas suggested previously).
> >> Everything with is not managed can be steered by user space.
> > Yes, but it seems to be racy (that is, there is a window where the 
> > interrupt can be delivered to an isolated CPU).
> >
> > ethtool ->
> > xgbe_set_channels ->
> > xgbe_full_restart_dev ->
> > xgbe_alloc_memory ->
> > xgbe_alloc_channels ->
> > cpumask_local_spread
> >
> > Also ifconfig eth0 down / ifconfig eth0 up leads
> > to cpumask_spread_local.
> 
> There's always that possibility.

Then there is a window where isolation can be broken.

> We have to ensure that we move the IRQs by a tuned daemon or some other
> userspace script every time there is a net-dev change (eg. device comes up,
> creates VFs, etc).

Again, race window open can result in interrupt to isolated CPU.

> > How about adding a new flag for isolcpus instead?
> >
> 
> Do you mean a flag based on which we can switch the affinity mask to
> housekeeping for all the devices at the time of IRQ distribution?

Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-02-04 19:06                 ` Marcelo Tosatti
@ 2021-02-04 19:17                   ` Nitesh Narayan Lal
  2021-02-05 22:23                     ` Thomas Gleixner
  0 siblings, 1 reply; 52+ messages in thread
From: Nitesh Narayan Lal @ 2021-02-04 19:17 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Thomas Gleixner, Robin Murphy, linux-kernel, linux-api, frederic,
	juri.lelli, abelits, bhelgaas, linux-pci, rostedt, mingo, peterz,
	davem, akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun


On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
> On Thu, Feb 04, 2021 at 01:47:38PM -0500, Nitesh Narayan Lal wrote:

[...]

>>>>> Nitesh, is there anything preventing this from being fixed
>>>>> in userspace ? (as Thomas suggested previously).
>>>> Everything with is not managed can be steered by user space.
>>> Yes, but it seems to be racy (that is, there is a window where the 
>>> interrupt can be delivered to an isolated CPU).
>>>
>>> ethtool ->
>>> xgbe_set_channels ->
>>> xgbe_full_restart_dev ->
>>> xgbe_alloc_memory ->
>>> xgbe_alloc_channels ->
>>> cpumask_local_spread
>>>
>>> Also ifconfig eth0 down / ifconfig eth0 up leads
>>> to cpumask_spread_local.
>> There's always that possibility.
> Then there is a window where isolation can be broken.
>
>> We have to ensure that we move the IRQs by a tuned daemon or some other
>> userspace script every time there is a net-dev change (eg. device comes up,
>> creates VFs, etc).
> Again, race window open can result in interrupt to isolated CPU.


Yes, and if I am not mistaken then that always has been a problem with
these userspace solutions.

>
>>> How about adding a new flag for isolcpus instead?
>>>
>> Do you mean a flag based on which we can switch the affinity mask to
>> housekeeping for all the devices at the time of IRQ distribution?
> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>
>

Does sounds like a nice idea to explore, lets see what Thomas thinks about it.

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
       [not found]                 ` <18584612-868c-0f88-5de2-dc93c8638816@redhat.com>
@ 2021-02-05 19:56                   ` Thomas Gleixner
  0 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2021-02-05 19:56 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Marcelo Tosatti
  Cc: Robin Murphy, linux-kernel, linux-api, frederic, juri.lelli,
	abelits, bhelgaas, linux-pci, rostedt, mingo, peterz, davem,
	akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun

On Fri, Jan 29 2021 at 09:35, Nitesh Narayan Lal wrote:
> On 1/29/21 9:23 AM, Marcelo Tosatti wrote:
>>> I am not sure about the PCI patch as I don't think we can control that from
>>> the userspace or maybe I am wrong?
>> You mean "lib: Restrict cpumask_local_spread to housekeeping CPUs" ?
>
> No, "PCI: Restrict probe functions to housekeeping CPUs".

That part is fine because it just moves the probing to a work queue on a
housekeeping CPU. But that has nothing to do with the interrupt
spreading library.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-02-04 19:17                   ` Nitesh Narayan Lal
@ 2021-02-05 22:23                     ` Thomas Gleixner
  2021-02-05 22:26                       ` Thomas Gleixner
                                         ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Thomas Gleixner @ 2021-02-05 22:23 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Marcelo Tosatti
  Cc: Robin Murphy, linux-kernel, linux-api, frederic, juri.lelli,
	abelits, bhelgaas, linux-pci, rostedt, mingo, peterz, davem,
	akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun

On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>> How about adding a new flag for isolcpus instead?
>>>>
>>> Do you mean a flag based on which we can switch the affinity mask to
>>> housekeeping for all the devices at the time of IRQ distribution?
>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>
> Does sounds like a nice idea to explore, lets see what Thomas thinks about it.

I just read back up on that whole discussion and stared into the usage
sites a bit.

There are a couple of issues here in a larger picture. Looking at it
from the device side first:

The spreading is done for non-managed queues/interrupts which makes them
movable by user space. So it could be argued from both sides that the
damage done by allowing the full online mask or by allowing only the
house keeping mask can be fixed up by user space.

But that's the trivial part of the problem. The real problem is CPU
hotplug and offline CPUs and the way how interrupts are set up for their
initial affinity.

As Robin noticed, the change in 1abdfe706a57 ("lib: Restrict
cpumask_local_spread to houskeeping CPUs") is broken as it can return
offline CPUs in both the NOHZ_FULL and the !NOHZ_FULL case.

The original code is racy vs. hotplug unless the callers block hotplug.

Let's look at all the callers and what they do with it.

  cptvf_set_irq_affinity()     affinity hint
  safexcel_request_ring_irq()  affinity hint
  mv_cesa_probe()              affinity hint
  bnxt_request_irq()           affinity hint
  nicvf_set_irq_affinity()     affinity hint
  cxgb4_set_msix_aff()         affinity hint
  enic_init_affinity_hint(()   affinity hint
  iavf_request_traffic_irqs()  affinity hint
  ionic_alloc_qcq_interrupt()  affinity hint
  efx_set_interrupt_affinity() affinity hint
  i40e_vsi_request_irq_msix()  affinity hint

  be_evt_queues_create()       affinity hint, queue affinity
  hns3_nic_set_cpumask()       affinity hint, queue affinity
  mlx4_en_init_affinity_hint() affinity hint, queue affinity
  mlx4_en_create_tx_ring()     affinity hint, queue affinity
  set_comp_irq_affinity_hint() affinity hint, queue affinity
  i40e_config_xps_tx_ring()    affinity hint, queue affinity
  
  hclge_configure              affinity_hint, queue affinity, workqueue selection

  ixgbe_alloc_q_vector()       node selection, affinity hint, queue affinity

All of them do not care about disabling hotplug. Taking cpu_read_lock()
inside of that spread function would not solve anything because once the
lock is dropped the CPU can go away.

There are 3 classes of this:

   1) Does not matter: affinity hint

   2) Might fail to set up the network queue when the selected CPU
      is offline.

   3) Broken: The hclge driver which uses the cpu to schedule work on
      that cpu. That's broken, but unfortunately neither the workqueue
      code nor the timer code will ever notice. The work just wont be
      scheduled until the CPU comes online again which might be never.

But looking at the above I really have to ask the question what the
commit in question is actually trying to solve.

AFAICT, nothing at all. Why?

  1) The majority of the drivers sets the hint __after_ requesting the
     interrupt

  2) Even if set _before_ requesting the interrupt it does not solve
     anything because it's a hint and the interrupt core code does
     not care about it at all. It provides the storage and the procfs
     interface nothing else.

So how does that prevent the interrupt subsystem from assigning an
interrupt to an isolated CPU? Not at all.

Interrupts which are freshly allocated get the default interrupt
affinity mask, which is either set on the command line or via /proc. The
affinity of the interrupt can be changed after it has been populated in
/proc.

When the interrupt is requested then one of the online CPUs in it's
affinity mask is chosen.

X86 is special here because this also requires that there are free
vectors on one of the online CPUs in the mask. If the CPUs in the
affinity mask run out of vectors then it will grab a vector from some
other CPU which might be an isolated CPU.

When the affinity mask of the interrupt at the time when it is actually
requested contains an isolated CPU then nothing prevents the kernel from
steering it at an isolated CPU. But that has absolutely nothing to do
with that spreading thingy.

The only difference which this change makes is the fact that the
affinity hint changes. Nothing else.

This whole blurb about it might break isolation when an interrupt is
requested is just nonsensical, really.

If the default affinity mask is not correctly set up before devices are
initialized then it's not going to be cured by changing that spread
function. If the user space irq balancer ignores the isolation mask and
blindly moves stuff to the affinity hint, then this monstrosity needs to
be fixed.

So I'm going to revert this commit because it _IS_ broken _AND_ useless
and does not solve anything it claims to solve.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-02-05 22:23                     ` Thomas Gleixner
@ 2021-02-05 22:26                       ` Thomas Gleixner
  2021-02-05 23:02                       ` [tip: sched/urgent] Revert "lib: Restrict cpumask_local_spread to houskeeping CPUs" tip-bot2 for Thomas Gleixner
  2021-02-07  0:43                       ` [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs Nitesh Narayan Lal
  2 siblings, 0 replies; 52+ messages in thread
From: Thomas Gleixner @ 2021-02-05 22:26 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Marcelo Tosatti
  Cc: Robin Murphy, linux-kernel, linux-api, frederic, juri.lelli,
	abelits, bhelgaas, linux-pci, rostedt, mingo, peterz, davem,
	akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun

On Fri, Feb 05 2021 at 23:23, Thomas Gleixner wrote:
> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>> How about adding a new flag for isolcpus instead?
>>>>>
>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>> housekeeping for all the devices at the time of IRQ distribution?
>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>>
>> Does sounds like a nice idea to explore, lets see what Thomas thinks about it.

<.SNIP.>

> So I'm going to revert this commit because it _IS_ broken _AND_ useless
> and does not solve anything it claims to solve.

And no. HK_FLAG_IRQ_SPREAD is not going to solve anything either.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [tip: sched/urgent] Revert "lib: Restrict cpumask_local_spread to houskeeping CPUs"
  2021-02-05 22:23                     ` Thomas Gleixner
  2021-02-05 22:26                       ` Thomas Gleixner
@ 2021-02-05 23:02                       ` tip-bot2 for Thomas Gleixner
  2021-02-07  0:43                       ` [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs Nitesh Narayan Lal
  2 siblings, 0 replies; 52+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2021-02-05 23:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Robin Murphy, Thomas Gleixner, Nitesh Narayan Lal,
	Marcelo Tosatti, abelits, davem, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     2452483d9546de1c540f330469dc4042ff089731
Gitweb:        https://git.kernel.org/tip/2452483d9546de1c540f330469dc4042ff089731
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 05 Feb 2021 23:28:29 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 05 Feb 2021 23:28:29 +01:00

Revert "lib: Restrict cpumask_local_spread to houskeeping CPUs"

This reverts commit 1abdfe706a579a702799fce465bceb9fb01d407c.

This change is broken and not solving any problem it claims to solve.

Robin reported that cpumask_local_spread() now returns any cpu out of
cpu_possible_mask in case that NOHZ_FULL is disabled (runtime or compile
time). It can also return any offline or not-present CPU in the
housekeeping mask. Before that it was returning a CPU out of
online_cpu_mask.

While the function is racy against CPU hotplug if the caller does not
protect against it, the actual use cases are not caring much about it as
they use it mostly as hint for:

 - the user space affinity hint which is unused by the kernel
 - memory node selection which is just suboptimal
 - network queue affinity which might fail but is handled gracefully

But the occasional fail vs. hotplug is very different from returning
anything from possible_cpu_mask which can have a large amount of offline
CPUs obviously.

The changelog of the commit claims:

 "The current implementation of cpumask_local_spread() does not respect
  the isolated CPUs, i.e., even if a CPU has been isolated for Real-Time
  task, it will return it to the caller for pinning of its IRQ
  threads. Having these unwanted IRQ threads on an isolated CPU adds up
  to a latency overhead."

The only correct part of this changelog is:

 "The current implementation of cpumask_local_spread() does not respect
  the isolated CPUs."

Everything else is just disjunct from reality.

Reported-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: abelits@marvell.com
Cc: davem@davemloft.net
Link: https://lore.kernel.org/r/87y2g26tnt.fsf@nanos.tec.linutronix.de
---
 lib/cpumask.c | 16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index 3592402..c3c76b8 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,7 +6,6 @@
 #include <linux/export.h>
 #include <linux/memblock.h>
 #include <linux/numa.h>
-#include <linux/sched/isolation.h>
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -206,27 +205,22 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
  */
 unsigned int cpumask_local_spread(unsigned int i, int node)
 {
-	int cpu, hk_flags;
-	const struct cpumask *mask;
+	int cpu;
 
-	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
-	mask = housekeeping_cpumask(hk_flags);
 	/* Wrap: we always want a cpu. */
-	i %= cpumask_weight(mask);
+	i %= num_online_cpus();
 
 	if (node == NUMA_NO_NODE) {
-		for_each_cpu(cpu, mask) {
+		for_each_cpu(cpu, cpu_online_mask)
 			if (i-- == 0)
 				return cpu;
-		}
 	} else {
 		/* NUMA first. */
-		for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
+		for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
 			if (i-- == 0)
 				return cpu;
-		}
 
-		for_each_cpu(cpu, mask) {
+		for_each_cpu(cpu, cpu_online_mask) {
 			/* Skip NUMA nodes, done above. */
 			if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
 				continue;

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-02-05 22:23                     ` Thomas Gleixner
  2021-02-05 22:26                       ` Thomas Gleixner
  2021-02-05 23:02                       ` [tip: sched/urgent] Revert "lib: Restrict cpumask_local_spread to houskeeping CPUs" tip-bot2 for Thomas Gleixner
@ 2021-02-07  0:43                       ` Nitesh Narayan Lal
  2021-02-11 15:55                         ` Nitesh Narayan Lal
  2 siblings, 1 reply; 52+ messages in thread
From: Nitesh Narayan Lal @ 2021-02-07  0:43 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti
  Cc: Robin Murphy, linux-kernel, linux-api, frederic, juri.lelli,
	abelits, bhelgaas, linux-pci, rostedt, mingo, peterz, davem,
	akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun


On 2/5/21 5:23 PM, Thomas Gleixner wrote:
> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>> How about adding a new flag for isolcpus instead?
>>>>>
>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>> housekeeping for all the devices at the time of IRQ distribution?
>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>> Does sounds like a nice idea to explore, lets see what Thomas thinks about it.
> I just read back up on that whole discussion and stared into the usage
> sites a bit.
>
> There are a couple of issues here in a larger picture. Looking at it
> from the device side first:
>
> The spreading is done for non-managed queues/interrupts which makes them
> movable by user space. So it could be argued from both sides that the
> damage done by allowing the full online mask or by allowing only the
> house keeping mask can be fixed up by user space.
>
> But that's the trivial part of the problem. The real problem is CPU
> hotplug and offline CPUs and the way how interrupts are set up for their
> initial affinity.
>
> As Robin noticed, the change in 1abdfe706a57 ("lib: Restrict
> cpumask_local_spread to houskeeping CPUs") is broken as it can return
> offline CPUs in both the NOHZ_FULL and the !NOHZ_FULL case.

A quick question here, is there any reason why we don't have cpu_online_mask
instead of cpu_possible_mask in the housekeeping_cpumask()?
(not for this particular patch but in general)

>
> The original code is racy vs. hotplug unless the callers block hotplug.
>
> Let's look at all the callers and what they do with it.
>
>   cptvf_set_irq_affinity()     affinity hint
>   safexcel_request_ring_irq()  affinity hint
>   mv_cesa_probe()              affinity hint
>   bnxt_request_irq()           affinity hint
>   nicvf_set_irq_affinity()     affinity hint
>   cxgb4_set_msix_aff()         affinity hint
>   enic_init_affinity_hint(()   affinity hint
>   iavf_request_traffic_irqs()  affinity hint
>   ionic_alloc_qcq_interrupt()  affinity hint
>   efx_set_interrupt_affinity() affinity hint
>   i40e_vsi_request_irq_msix()  affinity hint
>
>   be_evt_queues_create()       affinity hint, queue affinity
>   hns3_nic_set_cpumask()       affinity hint, queue affinity
>   mlx4_en_init_affinity_hint() affinity hint, queue affinity
>   mlx4_en_create_tx_ring()     affinity hint, queue affinity
>   set_comp_irq_affinity_hint() affinity hint, queue affinity
>   i40e_config_xps_tx_ring()    affinity hint, queue affinity
>   
>   hclge_configure              affinity_hint, queue affinity, workqueue selection
>
>   ixgbe_alloc_q_vector()       node selection, affinity hint, queue affinity
>
> All of them do not care about disabling hotplug. Taking cpu_read_lock()
> inside of that spread function would not solve anything because once the
> lock is dropped the CPU can go away.
>
> There are 3 classes of this:
>
>    1) Does not matter: affinity hint
>
>    2) Might fail to set up the network queue when the selected CPU
>       is offline.
>
>    3) Broken: The hclge driver which uses the cpu to schedule work on
>       that cpu. That's broken, but unfortunately neither the workqueue
>       code nor the timer code will ever notice. The work just wont be
>       scheduled until the CPU comes online again which might be never.

Agreed.

> But looking at the above I really have to ask the question what the
> commit in question is actually trying to solve.
>
> AFAICT, nothing at all. Why?
>
>   1) The majority of the drivers sets the hint __after_ requesting the
>      interrupt
>
>   2) Even if set _before_ requesting the interrupt it does not solve
>      anything because it's a hint and the interrupt core code does
>      not care about it at all. It provides the storage and the procfs
>      interface nothing else.
>
> So how does that prevent the interrupt subsystem from assigning an
> interrupt to an isolated CPU? Not at all.
>
> Interrupts which are freshly allocated get the default interrupt
> affinity mask, which is either set on the command line or via /proc. The
> affinity of the interrupt can be changed after it has been populated in
> /proc.
>
> When the interrupt is requested then one of the online CPUs in it's
> affinity mask is chosen.
>
> X86 is special here because this also requires that there are free
> vectors on one of the online CPUs in the mask. If the CPUs in the
> affinity mask run out of vectors then it will grab a vector from some
> other CPU which might be an isolated CPU.
>
> When the affinity mask of the interrupt at the time when it is actually
> requested contains an isolated CPU then nothing prevents the kernel from
> steering it at an isolated CPU. But that has absolutely nothing to do
> with that spreading thingy.
>
> The only difference which this change makes is the fact that the
> affinity hint changes. Nothing else.
>

Thanks for the detailed explanation.

Before I posted this patch, I was doing some debugging on a setup where I
was observing some latency issues due to the iavf IRQs that were pinned on
the isolated CPUs.

Based on some initial traces I had this impression that the affinity hint
or cpumask_local_spread was somehow playing a role in deciding the affinity
mask of these IRQs. Although, that does look incorrect after going through
your explanation.
For some reason, with a kernel that had this patch when I tried creating
VFs iavf IRQs always ended up on the HK CPUs.

The reasoning for the above is still not very clear to me. I will investigate
this further to properly understand this behavior.

-- 
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-02-07  0:43                       ` [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs Nitesh Narayan Lal
@ 2021-02-11 15:55                         ` Nitesh Narayan Lal
  2021-03-04 18:15                           ` Nitesh Narayan Lal
  0 siblings, 1 reply; 52+ messages in thread
From: Nitesh Narayan Lal @ 2021-02-11 15:55 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti
  Cc: Robin Murphy, linux-kernel, linux-api, frederic, juri.lelli,
	abelits, bhelgaas, linux-pci, rostedt, mingo, peterz, davem,
	akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun


On 2/6/21 7:43 PM, Nitesh Narayan Lal wrote:
> On 2/5/21 5:23 PM, Thomas Gleixner wrote:
>> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>>> How about adding a new flag for isolcpus instead?
>>>>>>
>>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>>> housekeeping for all the devices at the time of IRQ distribution?
>>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>>> Does sounds like a nice idea to explore, lets see what Thomas thinks about it.

<snip>

>>> When the affinity mask of the interrupt at the time when it is actually
>>> requested contains an isolated CPU then nothing prevents the kernel from
>>> steering it at an isolated CPU. But that has absolutely nothing to do
>>> with that spreading thingy.
>>>
>>> The only difference which this change makes is the fact that the
>>> affinity hint changes. Nothing else.
>>>
> Thanks for the detailed explanation.
>
> Before I posted this patch, I was doing some debugging on a setup where I
> was observing some latency issues due to the iavf IRQs that were pinned on
> the isolated CPUs.
>
> Based on some initial traces I had this impression that the affinity hint
> or cpumask_local_spread was somehow playing a role in deciding the affinity
> mask of these IRQs. Although, that does look incorrect after going through
> your explanation.
> For some reason, with a kernel that had this patch when I tried creating
> VFs iavf IRQs always ended up on the HK CPUs.
>
> The reasoning for the above is still not very clear to me. I will investigate
> this further to properly understand this behavior.
>
>

After a little more digging, I found out why cpumask_local_spread change
affects the general/initial smp_affinity for certain device IRQs.

After the introduction of the commit:

    e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()

For all the drivers that set hint, initial affinity is set based on the
CPU retrieved from cpumask_local_spread. So in an environment where
irqbalance is disabled, these device IRQs remain on the CPUs that are
picked from cpumask_local_spread even though they are isolated. I think
the commit message of the reverted patch should have covered this as
well.

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-02-11 15:55                         ` Nitesh Narayan Lal
@ 2021-03-04 18:15                           ` Nitesh Narayan Lal
       [not found]                             ` <faa8d84e-db67-7fbe-891e-f4987f106b20@marvell.com>
  2021-04-06 17:22                             ` Jesse Brandeburg
  0 siblings, 2 replies; 52+ messages in thread
From: Nitesh Narayan Lal @ 2021-03-04 18:15 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti, Jesse Brandeburg
  Cc: Robin Murphy, linux-kernel, linux-api, frederic, juri.lelli,
	abelits, bhelgaas, linux-pci, rostedt, mingo, peterz, davem,
	akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun


On 2/11/21 10:55 AM, Nitesh Narayan Lal wrote:
> On 2/6/21 7:43 PM, Nitesh Narayan Lal wrote:
>> On 2/5/21 5:23 PM, Thomas Gleixner wrote:
>>> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>>>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>>>> How about adding a new flag for isolcpus instead?
>>>>>>>
>>>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>>>> housekeeping for all the devices at the time of IRQ distribution?
>>>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>>>> Does sounds like a nice idea to explore, lets see what Thomas thinks about it.
> <snip>
>
>>>> When the affinity mask of the interrupt at the time when it is actually
>>>> requested contains an isolated CPU then nothing prevents the kernel from
>>>> steering it at an isolated CPU. But that has absolutely nothing to do
>>>> with that spreading thingy.
>>>>
>>>> The only difference which this change makes is the fact that the
>>>> affinity hint changes. Nothing else.
>>>>
>> Thanks for the detailed explanation.
>>
>> Before I posted this patch, I was doing some debugging on a setup where I
>> was observing some latency issues due to the iavf IRQs that were pinned on
>> the isolated CPUs.
>>
>> Based on some initial traces I had this impression that the affinity hint
>> or cpumask_local_spread was somehow playing a role in deciding the affinity
>> mask of these IRQs. Although, that does look incorrect after going through
>> your explanation.
>> For some reason, with a kernel that had this patch when I tried creating
>> VFs iavf IRQs always ended up on the HK CPUs.
>>
>> The reasoning for the above is still not very clear to me. I will investigate
>> this further to properly understand this behavior.
>>
>>
> After a little more digging, I found out why cpumask_local_spread change
> affects the general/initial smp_affinity for certain device IRQs.
>
> After the introduction of the commit:
>
>     e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
>

Continuing the conversation about the above commit and adding Jesse.
I was trying to understand the problem that the commit message explains
"The default behavior of the kernel is somewhat undesirable as all
requested interrupts end up on CPU0 after registration.", I have also been
trying to reproduce this behavior without the patch but I failed in doing
so, maybe because I am missing something here.

@Jesse Can you please explain? FWIU IRQ affinity should be decided based on
the default affinity mask.

The problem with the commit is that when we overwrite the affinity mask
based on the hinting mask we completely ignore the default SMP affinity
mask. If we do want to overwrite the affinity based on the hint mask we
should atleast consider the default SMP affinity.

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [EXT] Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
       [not found]                             ` <faa8d84e-db67-7fbe-891e-f4987f106b20@marvell.com>
@ 2021-03-04 23:23                               ` Nitesh Narayan Lal
  0 siblings, 0 replies; 52+ messages in thread
From: Nitesh Narayan Lal @ 2021-03-04 23:23 UTC (permalink / raw)
  To: Alex Belits, Thomas Gleixner, Marcelo Tosatti, Jesse Brandeburg
  Cc: Robin Murphy, linux-kernel, linux-api, frederic, juri.lelli,
	bhelgaas, linux-pci, rostedt, mingo, peterz, davem, akpm, sfr,
	stephen, rppt, jinyuqi, zhangshaokun


On 3/4/21 4:13 PM, Alex Belits wrote:
> On 3/4/21 10:15, Nitesh Narayan Lal wrote:
>> External Email
>>
>> ----------------------------------------------------------------------
>>
>> On 2/11/21 10:55 AM, Nitesh Narayan Lal wrote:
>>> On 2/6/21 7:43 PM, Nitesh Narayan Lal wrote:
>>>> On 2/5/21 5:23 PM, Thomas Gleixner wrote:
>>>>> On Thu, Feb 04 2021 at 14:17, Nitesh Narayan Lal wrote:
>>>>>> On 2/4/21 2:06 PM, Marcelo Tosatti wrote:
>>>>>>>>> How about adding a new flag for isolcpus instead?
>>>>>>>>>
>>>>>>>> Do you mean a flag based on which we can switch the affinity mask to
>>>>>>>> housekeeping for all the devices at the time of IRQ distribution?
>>>>>>> Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.
>>>>>> Does sounds like a nice idea to explore, lets see what Thomas thinks
>>>>>> about it.
>>> <snip>
>>>
>>>>>> When the affinity mask of the interrupt at the time when it is actually
>>>>>> requested contains an isolated CPU then nothing prevents the kernel from
>>>>>> steering it at an isolated CPU. But that has absolutely nothing to do
>>>>>> with that spreading thingy.
>>>>>>
>>>>>> The only difference which this change makes is the fact that the
>>>>>> affinity hint changes. Nothing else.
>>>>>>
>>>> Thanks for the detailed explanation.
>>>>
>>>> Before I posted this patch, I was doing some debugging on a setup where I
>>>> was observing some latency issues due to the iavf IRQs that were pinned on
>>>> the isolated CPUs.
>>>>
>>>> Based on some initial traces I had this impression that the affinity hint
>>>> or cpumask_local_spread was somehow playing a role in deciding the affinity
>>>> mask of these IRQs. Although, that does look incorrect after going through
>>>> your explanation.
>>>> For some reason, with a kernel that had this patch when I tried creating
>>>> VFs iavf IRQs always ended up on the HK CPUs.
>>>>
>>>> The reasoning for the above is still not very clear to me. I will
>>>> investigate
>>>> this further to properly understand this behavior.
>>>>
>>>>
>>> After a little more digging, I found out why cpumask_local_spread change
>>> affects the general/initial smp_affinity for certain device IRQs.
>>>
>>> After the introduction of the commit:
>>>
>>>      e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
>>>
>>
>> Continuing the conversation about the above commit and adding Jesse.
>> I was trying to understand the problem that the commit message explains
>> "The default behavior of the kernel is somewhat undesirable as all
>> requested interrupts end up on CPU0 after registration.", I have also been
>> trying to reproduce this behavior without the patch but I failed in doing
>> so, maybe because I am missing something here.
>>
>> @Jesse Can you please explain? FWIU IRQ affinity should be decided based on
>> the default affinity mask.
>>
>> The problem with the commit is that when we overwrite the affinity mask
>> based on the hinting mask we completely ignore the default SMP affinity
>> mask. If we do want to overwrite the affinity based on the hint mask we
>> should atleast consider the default SMP affinity.
>>
>
> cpumask_local_spread() is used by a small number of drivers, mostly for
> Ethernet and cryptographic devices, however it includes Cavium and Marvell
> devices that were included in every piece of hardware that I and Yury Norov
> worked on. Without my patch (or previous, later replaced, Yury's patch that
> was developed before there were housekeeping CPUs), driver would completely
> break any attempts to configure task isolation, because it would distribute
> processing over CPUs regardless of any masks related to isolation (and later
> housekeeping). This is why it was created, and it just happens that it also
> makes sense for CPU isolation in general. Of course, none of it would be
> experienced on hardware that does not include those devices, possibly
> creating some wrong impression about its effect and purpose.
>
> It may be that my patch can be criticized for not accommodating CPU hotplug
> and other runtime changes of masks. Or drivers can be criticized for their
> behavior that relies on calling cpumask_local_spread() once on
> initialization and then assuming that all CPUs are configured forever.
> However as far as I can tell, currently we have no other means of
> controlling the behavior of drivers that manage their own interrupt or
> thread to CPU mapping, and no way to communicate any of those changes to
> them while they are running. Drivers may have legitimate reasons for
> maintaining permanent or semi-permanent CPU core to interrupt mapping,
> especially on devices with very large numbers of CPU cores and built-in
> support for parallel processing of network packets.
>
> If we want it to be done in some manner that accommodates current demands,
> we should simply replace cpumask_local_spread() with something else, or,
> maybe, add some means that will allow dynamic changes. Thankfully, there are
> very few (IIRC, 19) places where cpumask_local_spread() is used, so it may
> be accommodated with relatively small amount of code to write and test. Then
> everything else will be able to switch to the same mechanism whenever
> necessary.
>

So there are two different issues, the first issue is how the mask
retrieved based on the cpumask_local_spread is used to set IRQ affinity.
Ideally when a device is initialized its IRQs are distributed based on the
default SMP affinity mask (considering irqbalance is disabled). However, it
is not the case right now as some drivers that set their hint affinity
using cpumask_local_spread overwrites the previously set affinity mask for
the IRQs. So even if you configure the default_smp_affinity from the
userspace it will not affect these device IRQs. This is precisely why your
fix for cpumask_local_spread helped in improving the isolation.

The second issue that you brought up is to balance the IRQ-specific load
between CPUs efficiently. FWIU if you have irqbalance enabled it should
already be doing that based on the policy that you define in the userspace.
Is that not the case or maybe I am missing something?

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-03-04 18:15                           ` Nitesh Narayan Lal
       [not found]                             ` <faa8d84e-db67-7fbe-891e-f4987f106b20@marvell.com>
@ 2021-04-06 17:22                             ` Jesse Brandeburg
  2021-04-07 15:18                               ` Nitesh Narayan Lal
  1 sibling, 1 reply; 52+ messages in thread
From: Jesse Brandeburg @ 2021-04-06 17:22 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Thomas Gleixner, Marcelo Tosatti, Robin Murphy, linux-kernel,
	linux-api, frederic, juri.lelli, abelits, bhelgaas, linux-pci,
	rostedt, mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun

Continuing a thread from a bit ago...

Nitesh Narayan Lal wrote:

> > After a little more digging, I found out why cpumask_local_spread change
> > affects the general/initial smp_affinity for certain device IRQs.
> >
> > After the introduction of the commit:
> >
> >     e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
> >
> 
> Continuing the conversation about the above commit and adding Jesse.
> I was trying to understand the problem that the commit message explains
> "The default behavior of the kernel is somewhat undesirable as all
> requested interrupts end up on CPU0 after registration.", I have also been
> trying to reproduce this behavior without the patch but I failed in doing
> so, maybe because I am missing something here.
> 
> @Jesse Can you please explain? FWIU IRQ affinity should be decided based on
> the default affinity mask.

The original issue as seen, was that if you rmmod/insmod a driver
*without* irqbalance running, the default irq mask is -1, which means
any CPU. The older kernels (this issue was patched in 2014) used to use
that affinity mask, but the value programmed into all the interrupt
registers "actual affinity" would end up delivering all interrupts to
CPU0, and if the machine was under traffic load incoming when the
driver loaded, CPU0 would start to poll among all the different netdev
queues, all on CPU0.

The above then leads to the condition that the device is stuck polling
even if the affinity gets updated from user space, and the polling will
continue until traffic stops.

> The problem with the commit is that when we overwrite the affinity mask
> based on the hinting mask we completely ignore the default SMP affinity
> mask. If we do want to overwrite the affinity based on the hint mask we
> should atleast consider the default SMP affinity.

Maybe the right thing is to fix which CPUs are passed in as the valid
mask, or make sure the kernel cross checks that what the driver asks
for is a "valid CPU"?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-04-06 17:22                             ` Jesse Brandeburg
@ 2021-04-07 15:18                               ` Nitesh Narayan Lal
  2021-04-08 18:49                                 ` Nitesh Narayan Lal
  2021-04-14 16:11                                 ` Jesse Brandeburg
  0 siblings, 2 replies; 52+ messages in thread
From: Nitesh Narayan Lal @ 2021-04-07 15:18 UTC (permalink / raw)
  To: Jesse Brandeburg, Marcelo Tosatti, Thomas Gleixner, frederic
  Cc: Robin Murphy, linux-kernel, linux-api, juri.lelli, abelits,
	bhelgaas, linux-pci, rostedt, mingo, peterz, davem, akpm, sfr,
	stephen, rppt, jinyuqi, zhangshaokun


On 4/6/21 1:22 PM, Jesse Brandeburg wrote:
> Continuing a thread from a bit ago...
>
> Nitesh Narayan Lal wrote:
>
>>> After a little more digging, I found out why cpumask_local_spread change
>>> affects the general/initial smp_affinity for certain device IRQs.
>>>
>>> After the introduction of the commit:
>>>
>>>     e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
>>>
>> Continuing the conversation about the above commit and adding Jesse.
>> I was trying to understand the problem that the commit message explains
>> "The default behavior of the kernel is somewhat undesirable as all
>> requested interrupts end up on CPU0 after registration.", I have also been
>> trying to reproduce this behavior without the patch but I failed in doing
>> so, maybe because I am missing something here.
>>
>> @Jesse Can you please explain? FWIU IRQ affinity should be decided based on
>> the default affinity mask.

Thanks, Jesse for responding.

> The original issue as seen, was that if you rmmod/insmod a driver
> *without* irqbalance running, the default irq mask is -1, which means
> any CPU. The older kernels (this issue was patched in 2014) used to use
> that affinity mask, but the value programmed into all the interrupt
> registers "actual affinity" would end up delivering all interrupts to
> CPU0,

So does that mean the affinity mask for the IRQs was different wrt where
the IRQs were actually delivered?
Or, the affinity mask itself for the IRQs after rmmod, insmod was changed
to 0 instead of -1?

I did a quick test on top of 5.12.0-rc6 by comparing the i40e IRQ affinity
mask before removing the kernel module and after doing rmmod+insmod
and didn't find any difference.

>  and if the machine was under traffic load incoming when the
> driver loaded, CPU0 would start to poll among all the different netdev
> queues, all on CPU0.
>
> The above then leads to the condition that the device is stuck polling
> even if the affinity gets updated from user space, and the polling will
> continue until traffic stops.
>
>> The problem with the commit is that when we overwrite the affinity mask
>> based on the hinting mask we completely ignore the default SMP affinity
>> mask. If we do want to overwrite the affinity based on the hint mask we
>> should atleast consider the default SMP affinity.

For the issue where the IRQs don't follow the default_smp_affinity mask
because of this patch, the following are the steps by which it can be easily
reproduced with the latest linux kernel:

# Kernel
5.12.0-rc6+

# Other pramaeters in the cmdline
isolcpus=2-39,44-79 nohz=on nohz_full=2-39,44-79
rcu_nocbs=2-39,44-79

# cat /proc/irq/default_smp_affinity
0000,00000f00,00000003 [Corresponds to HK CPUs - 0, 1, 40, 41, 42 and 43]

# Create VFs and check IRQ affinity mask

/proc/irq/1423/iavf-ens1f1v3-TxRx-3
3
/proc/irq/1424/iavf-0000:3b:0b.0:mbx
0
40
42
/proc/irq/1425/iavf-ens1f1v8-TxRx-0
0
/proc/irq/1426/iavf-ens1f1v8-TxRx-1
1
/proc/irq/1427/iavf-ens1f1v8-TxRx-2
2
/proc/irq/1428/iavf-ens1f1v8-TxRx-3
3
...
/proc/irq/1475/iavf-ens1f1v15-TxRx-0
0
/proc/irq/1476/iavf-ens1f1v15-TxRx-1
1
/proc/irq/1477/iavf-ens1f1v15-TxRx-2
2
/proc/irq/1478/iavf-ens1f1v15-TxRx-3
3
/proc/irq/1479/iavf-0000:3b:0a.0:mbx
0
40
42
...
/proc/irq/240/iavf-ens1f1v3-TxRx-0
0
/proc/irq/248/iavf-ens1f1v3-TxRx-1
1
/proc/irq/249/iavf-ens1f1v3-TxRx-2
2


Trace dump:
----------
..
11551082:  NetworkManager-1734  [040]  8167.465719: vector_activate:    
            irq=1478 is_managed=0 can_reserve=1 reserve=0
11551090:  NetworkManager-1734  [040]  8167.465720: vector_alloc:
            irq=1478 vector=65 reserved=1 ret=0
11551093:  NetworkManager-1734  [040]  8167.465721: vector_update:      
            irq=1478 vector=65 cpu=42 prev_vector=0 prev_cpu=0
11551097:  NetworkManager-1734  [040]  8167.465721: vector_config:      
            irq=1478 vector=65 cpu=42 apicdest=0x00000200
11551357:  NetworkManager-1734  [040]  8167.465768: vector_alloc:        
            irq=1478 vector=46 reserved=0 ret=0

11551360:  NetworkManager-1734  [040]  8167.465769: vector_update:      
            irq=1478 vector=46 cpu=3 prev_vector=65 prev_cpu=42

11551364:  NetworkManager-1734  [040]  8167.465770: vector_config:      
            irq=1478 vector=46 cpu=3 apicdest=0x00040100
..

As we can see in the above trace the initial affinity for the IRQ 1478 was
correctly set as per the default_smp_affinity mask which includes CPU 42,
however, later on, it is updated with CPU3 which is returned from
cpumask_local_spread().

> Maybe the right thing is to fix which CPUs are passed in as the valid
> mask, or make sure the kernel cross checks that what the driver asks
> for is a "valid CPU"?
>

Sure, if we can still reproduce the problem that your patch was fixing then
maybe we can consider adding a new API like cpumask_local_spread_irq in
which we should consider deafult_smp_affinity mask as well before returning
the CPU.

-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-04-07 15:18                               ` Nitesh Narayan Lal
@ 2021-04-08 18:49                                 ` Nitesh Narayan Lal
  2021-04-14 16:11                                 ` Jesse Brandeburg
  1 sibling, 0 replies; 52+ messages in thread
From: Nitesh Narayan Lal @ 2021-04-08 18:49 UTC (permalink / raw)
  To: Jesse Brandeburg, Marcelo Tosatti, Thomas Gleixner, frederic
  Cc: Robin Murphy, linux-kernel, linux-api, juri.lelli, abelits,
	bhelgaas, linux-pci, rostedt, mingo, peterz, davem, akpm, sfr,
	stephen, rppt, jinyuqi, zhangshaokun, Network Development,
	sassmann, Yang, Lihong


On 4/7/21 11:18 AM, Nitesh Narayan Lal wrote:
> On 4/6/21 1:22 PM, Jesse Brandeburg wrote:
>> Continuing a thread from a bit ago...
>>
>> Nitesh Narayan Lal wrote:
>>
>>>> After a little more digging, I found out why cpumask_local_spread change
>>>> affects the general/initial smp_affinity for certain device IRQs.
>>>>
>>>> After the introduction of the commit:
>>>>
>>>>     e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
>>>>
>>> Continuing the conversation about the above commit and adding Jesse.
>>> I was trying to understand the problem that the commit message explains
>>> "The default behavior of the kernel is somewhat undesirable as all
>>> requested interrupts end up on CPU0 after registration.", I have also been
>>> trying to reproduce this behavior without the patch but I failed in doing
>>> so, maybe because I am missing something here.
>>>
>>> @Jesse Can you please explain? FWIU IRQ affinity should be decided based on
>>> the default affinity mask.
> Thanks, Jesse for responding.
>
>> The original issue as seen, was that if you rmmod/insmod a driver
>> *without* irqbalance running, the default irq mask is -1, which means
>> any CPU. The older kernels (this issue was patched in 2014) used to use
>> that affinity mask, but the value programmed into all the interrupt
>> registers "actual affinity" would end up delivering all interrupts to
>> CPU0,
> So does that mean the affinity mask for the IRQs was different wrt where
> the IRQs were actually delivered?
> Or, the affinity mask itself for the IRQs after rmmod, insmod was changed
> to 0 instead of -1?
>
> I did a quick test on top of 5.12.0-rc6 by comparing the i40e IRQ affinity
> mask before removing the kernel module and after doing rmmod+insmod
> and didn't find any difference.
>
>>  and if the machine was under traffic load incoming when the
>> driver loaded, CPU0 would start to poll among all the different netdev
>> queues, all on CPU0.
>>
>> The above then leads to the condition that the device is stuck polling
>> even if the affinity gets updated from user space, and the polling will
>> continue until traffic stops.
>>
>>> The problem with the commit is that when we overwrite the affinity mask
>>> based on the hinting mask we completely ignore the default SMP affinity
>>> mask. If we do want to overwrite the affinity based on the hint mask we
>>> should atleast consider the default SMP affinity.
> For the issue where the IRQs don't follow the default_smp_affinity mask
> because of this patch, the following are the steps by which it can be easily
> reproduced with the latest linux kernel:
>
> # Kernel
> 5.12.0-rc6+
>
> # Other pramaeters in the cmdline
> isolcpus=2-39,44-79 nohz=on nohz_full=2-39,44-79
> rcu_nocbs=2-39,44-79
>
> # cat /proc/irq/default_smp_affinity
> 0000,00000f00,00000003 [Corresponds to HK CPUs - 0, 1, 40, 41, 42 and 43]
>
> # Create VFs and check IRQ affinity mask
>
> /proc/irq/1423/iavf-ens1f1v3-TxRx-3
> 3
> /proc/irq/1424/iavf-0000:3b:0b.0:mbx
> 0
> 40
> 42
> /proc/irq/1425/iavf-ens1f1v8-TxRx-0
> 0
> /proc/irq/1426/iavf-ens1f1v8-TxRx-1
> 1
> /proc/irq/1427/iavf-ens1f1v8-TxRx-2
> 2
> /proc/irq/1428/iavf-ens1f1v8-TxRx-3
> 3
> ...
> /proc/irq/1475/iavf-ens1f1v15-TxRx-0
> 0
> /proc/irq/1476/iavf-ens1f1v15-TxRx-1
> 1
> /proc/irq/1477/iavf-ens1f1v15-TxRx-2
> 2
> /proc/irq/1478/iavf-ens1f1v15-TxRx-3
> 3
> /proc/irq/1479/iavf-0000:3b:0a.0:mbx
> 0
> 40
> 42
> ...
> /proc/irq/240/iavf-ens1f1v3-TxRx-0
> 0
> /proc/irq/248/iavf-ens1f1v3-TxRx-1
> 1
> /proc/irq/249/iavf-ens1f1v3-TxRx-2
> 2
>
>
> Trace dump:
> ----------
> ..
> 11551082:  NetworkManager-1734  [040]  8167.465719: vector_activate:    
>             irq=1478 is_managed=0 can_reserve=1 reserve=0
> 11551090:  NetworkManager-1734  [040]  8167.465720: vector_alloc:
>             irq=1478 vector=65 reserved=1 ret=0
> 11551093:  NetworkManager-1734  [040]  8167.465721: vector_update:      
>             irq=1478 vector=65 cpu=42 prev_vector=0 prev_cpu=0
> 11551097:  NetworkManager-1734  [040]  8167.465721: vector_config:      
>             irq=1478 vector=65 cpu=42 apicdest=0x00000200
> 11551357:  NetworkManager-1734  [040]  8167.465768: vector_alloc:        
>             irq=1478 vector=46 reserved=0 ret=0
>
> 11551360:  NetworkManager-1734  [040]  8167.465769: vector_update:      
>             irq=1478 vector=46 cpu=3 prev_vector=65 prev_cpu=42
>
> 11551364:  NetworkManager-1734  [040]  8167.465770: vector_config:      
>             irq=1478 vector=46 cpu=3 apicdest=0x00040100
> ..
>
> As we can see in the above trace the initial affinity for the IRQ 1478 was
> correctly set as per the default_smp_affinity mask which includes CPU 42,
> however, later on, it is updated with CPU3 which is returned from
> cpumask_local_spread().
>
>> Maybe the right thing is to fix which CPUs are passed in as the valid
>> mask, or make sure the kernel cross checks that what the driver asks
>> for is a "valid CPU"?
>>
> Sure, if we can still reproduce the problem that your patch was fixing then
> maybe we can consider adding a new API like cpumask_local_spread_irq in
> which we should consider deafult_smp_affinity mask as well before returning
> the CPU.
>

Didn't realize that netdev ml was not included, so adding that.

-- 
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-04-07 15:18                               ` Nitesh Narayan Lal
  2021-04-08 18:49                                 ` Nitesh Narayan Lal
@ 2021-04-14 16:11                                 ` Jesse Brandeburg
  2021-04-15 22:11                                   ` Nitesh Narayan Lal
  1 sibling, 1 reply; 52+ messages in thread
From: Jesse Brandeburg @ 2021-04-14 16:11 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Marcelo Tosatti, Thomas Gleixner, frederic, Robin Murphy,
	linux-kernel, linux-api, juri.lelli, abelits, bhelgaas,
	linux-pci, rostedt, mingo, peterz, davem, akpm, sfr, stephen,
	rppt, jinyuqi, zhangshaokun, netdev, chris.friesen

Nitesh Narayan Lal wrote:

> > The original issue as seen, was that if you rmmod/insmod a driver
> > *without* irqbalance running, the default irq mask is -1, which means
> > any CPU. The older kernels (this issue was patched in 2014) used to use
> > that affinity mask, but the value programmed into all the interrupt
> > registers "actual affinity" would end up delivering all interrupts to
> > CPU0,
> 
> So does that mean the affinity mask for the IRQs was different wrt where
> the IRQs were actually delivered?
> Or, the affinity mask itself for the IRQs after rmmod, insmod was changed
> to 0 instead of -1?

The smp_affinity was 0xfff, and the kernel chooses which interrupt to
place the interrupt on, among any of the bits set.

 
> I did a quick test on top of 5.12.0-rc6 by comparing the i40e IRQ affinity
> mask before removing the kernel module and after doing rmmod+insmod
> and didn't find any difference.

with the patch in question removed? Sorry, I'm confused what you tried.

> 
> >  and if the machine was under traffic load incoming when the
> > driver loaded, CPU0 would start to poll among all the different netdev
> > queues, all on CPU0.
> >
> > The above then leads to the condition that the device is stuck polling
> > even if the affinity gets updated from user space, and the polling will
> > continue until traffic stops.
> >
> >> The problem with the commit is that when we overwrite the affinity mask
> >> based on the hinting mask we completely ignore the default SMP affinity
> >> mask. If we do want to overwrite the affinity based on the hint mask we
> >> should atleast consider the default SMP affinity.
> 
> For the issue where the IRQs don't follow the default_smp_affinity mask
> because of this patch, the following are the steps by which it can be easily
> reproduced with the latest linux kernel:
> 
> # Kernel
> 5.12.0-rc6+

<snip>

> As we can see in the above trace the initial affinity for the IRQ 1478 was
> correctly set as per the default_smp_affinity mask which includes CPU 42,
> however, later on, it is updated with CPU3 which is returned from
> cpumask_local_spread().
> 
> > Maybe the right thing is to fix which CPUs are passed in as the valid
> > mask, or make sure the kernel cross checks that what the driver asks
> > for is a "valid CPU"?
> >
> 
> Sure, if we can still reproduce the problem that your patch was fixing then
> maybe we can consider adding a new API like cpumask_local_spread_irq in
> which we should consider deafult_smp_affinity mask as well before returning
> the CPU.

I'm sure I don't have a reproducer of the original problem any more, it
is lost somewhere 8 years ago. I'd like to be able to repro the original
issue, but I can't.

Your description of the problem makes it obvious there is an issue. It
appears as if cpumask_local_spread() is the wrong function to use here.
If you have any suggestions please let me know.

We had one other report of this problem as well (I'm not sure if it's
the same as your report)
https://lkml.org/lkml/2021/3/28/206
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210125/023120.html


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-04-14 16:11                                 ` Jesse Brandeburg
@ 2021-04-15 22:11                                   ` Nitesh Narayan Lal
  2021-04-29 21:44                                     ` Nitesh Lal
  0 siblings, 1 reply; 52+ messages in thread
From: Nitesh Narayan Lal @ 2021-04-15 22:11 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Marcelo Tosatti, Thomas Gleixner, frederic, Robin Murphy,
	linux-kernel, linux-api, juri.lelli, abelits, bhelgaas,
	linux-pci, rostedt, mingo, peterz, davem, akpm, sfr, stephen,
	rppt, jinyuqi, zhangshaokun, netdev, chris.friesen


On 4/14/21 12:11 PM, Jesse Brandeburg wrote:
> Nitesh Narayan Lal wrote:
>
>>> The original issue as seen, was that if you rmmod/insmod a driver
>>> *without* irqbalance running, the default irq mask is -1, which means
>>> any CPU. The older kernels (this issue was patched in 2014) used to use
>>> that affinity mask, but the value programmed into all the interrupt
>>> registers "actual affinity" would end up delivering all interrupts to
>>> CPU0,
>> So does that mean the affinity mask for the IRQs was different wrt where
>> the IRQs were actually delivered?
>> Or, the affinity mask itself for the IRQs after rmmod, insmod was changed
>> to 0 instead of -1?
> The smp_affinity was 0xfff, and the kernel chooses which interrupt to
> place the interrupt on, among any of the bits set.


I think what you are referring to here is the effective affinity mask.
From one of Thomas's commit message that you pointed me to:

"The affinity mask is either the system-wide default or set by userspace,
but the architecture can or even must reduce the mask to the effective set,
which means that checking the affinity mask itself does not really tell
about the actual target CPUs."

I was looking into the code changes around IRQ and there has been major
rework from Thomas in 2017. I recently tried testing the kernel just before
those patches got merged.

Specifically on top of
05161b9cbe5:     x86/irq: Get rid of the 'first_system_vector' indirection
                 bogosity

On the box where I tested this, I was able to see the effective affinity
being set to 0 (not SMP affinity) for several network device IRQs.

and I think the reason for it is the usage of "cpumask_first_and(mask,
cpu_online_mask)" in __assign_irq_vector().

But with the latest kernel, this has been replaced and that's why I didn't
see the effective affinity being set to only 0 for all of the device IRQs.

Having said that I am still not sure if I was able to mimic what you have
seen in the past. But it looked similar to what you were explaining.
What do you think?



>
>  
>> I did a quick test on top of 5.12.0-rc6 by comparing the i40e IRQ affinity
>> mask before removing the kernel module and after doing rmmod+insmod
>> and didn't find any difference.
> with the patch in question removed? Sorry, I'm confused what you tried.

Yeah, but I was only referring to the SMP affinity mask. Please see more
up-to-date testing results above.

>>>  and if the machine was under traffic load incoming when the
>>> driver loaded, CPU0 would start to poll among all the different netdev
>>> queues, all on CPU0.
>>>
>>> The above then leads to the condition that the device is stuck polling
>>> even if the affinity gets updated from user space, and the polling will
>>> continue until traffic stops.
>>>

[...]

>> As we can see in the above trace the initial affinity for the IRQ 1478 was
>> correctly set as per the default_smp_affinity mask which includes CPU 42,
>> however, later on, it is updated with CPU3 which is returned from
>> cpumask_local_spread().
>>
>>> Maybe the right thing is to fix which CPUs are passed in as the valid
>>> mask, or make sure the kernel cross checks that what the driver asks
>>> for is a "valid CPU"?
>>>
>> Sure, if we can still reproduce the problem that your patch was fixing then
>> maybe we can consider adding a new API like cpumask_local_spread_irq in
>> which we should consider deafult_smp_affinity mask as well before returning
>> the CPU.
> I'm sure I don't have a reproducer of the original problem any more, it
> is lost somewhere 8 years ago. I'd like to be able to repro the original
> issue, but I can't.
>
> Your description of the problem makes it obvious there is an issue. It
> appears as if cpumask_local_spread() is the wrong function to use here.
> If you have any suggestions please let me know.
>
> We had one other report of this problem as well (I'm not sure if it's
> the same as your report)
> https://lkml.org/lkml/2021/3/28/206
> https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210125/023120.html


How about we introduce a new API just for IRQ spreading,
cpumask_local_spread_irq() and then utilize the default_smp_affinity mask
in that before returning the CPU?

Although, I think the right way to deal with this would be to fix this from
the source that is where the CPU mask is assigned to an IRQ for the very
first time.


-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-04-15 22:11                                   ` Nitesh Narayan Lal
@ 2021-04-29 21:44                                     ` Nitesh Lal
  2021-04-30  1:48                                       ` Jesse Brandeburg
  2021-04-30  7:10                                       ` Thomas Gleixner
  0 siblings, 2 replies; 52+ messages in thread
From: Nitesh Lal @ 2021-04-29 21:44 UTC (permalink / raw)
  To: Jesse Brandeburg, Thomas Gleixner, frederic, juri.lelli,
	Marcelo Tosatti, abelits
  Cc: Robin Murphy, linux-kernel, linux-api, bhelgaas, linux-pci,
	rostedt, mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun, netdev, chris.friesen

On Thu, Apr 15, 2021, at 6:11 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 4/14/21 12:11 PM, Jesse Brandeburg wrote:
> > Nitesh Narayan Lal wrote:
> >
> >>> The original issue as seen, was that if you rmmod/insmod a driver
> >>> *without* irqbalance running, the default irq mask is -1, which means
> >>> any CPU. The older kernels (this issue was patched in 2014) used to use
> >>> that affinity mask, but the value programmed into all the interrupt
> >>> registers "actual affinity" would end up delivering all interrupts to
> >>> CPU0,
> >> So does that mean the affinity mask for the IRQs was different wrt where
> >> the IRQs were actually delivered?
> >> Or, the affinity mask itself for the IRQs after rmmod, insmod was changed
> >> to 0 instead of -1?
> > The smp_affinity was 0xfff, and the kernel chooses which interrupt to
> > place the interrupt on, among any of the bits set.
>
>

<snip>

> >
> > Your description of the problem makes it obvious there is an issue. It
> > appears as if cpumask_local_spread() is the wrong function to use here.
> > If you have any suggestions please let me know.
> >
> > We had one other report of this problem as well (I'm not sure if it's
> > the same as your report)
> > https://lkml.org/lkml/2021/3/28/206
> > https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210125/023120.html
>
>

So to understand further what the problem was with the older kernel based
on Jesse's description and whether it is still there I did some more
digging. Following are some of the findings (kindly correct me if
there is a gap in my understanding):

Part-1: Why there was a problem with the older kernel?
------
With a kernel built on top of the tag v4.0.0 (with Jesse's patch reverted
and irqbalance disabled), if we observe the/proc/irq for ixgbe device IRQs
then there are two things to note:

# No separate effective affinity (Since it has been introduced as a part of
  the 2017 IRQ re-work)
  $ ls /proc/irq/86/
    affinity_hint  node  p2p1  smp_affinity  smp_affinity_list  spurious

# Multiple CPUs are set in the smp_affinity_list and the first CPU is CPU0:

  $ proc/irq/60/p2p1-TxRx-0
    0,2,4,6,8,10,12,14,16,18,20,22

  $ /proc/irq/61/p2p1-TxRx-1
    0,2,4,6,8,10,12,14,16,18,20,22

  $ /proc/irq/62/p2p1-TxRx-2
    0,2,4,6,8,10,12,14,16,18,20,22
     ...


Now,  if we read the commit message from Thomas's patch that was part of
this IRQ re-work:
fdba46ff:  x86/apic: Get rid of multi CPU affinity
"
..
2) Experiments have shown that the benefit of multi CPU affinity is close
   to zero and in some tests even worse than setting the affinity to a single
   CPU.

The reason for this is that the delivery targets the APIC with the lowest
ID first and only if that APIC is busy (servicing an interrupt, i.e. ISR is
not empty) it hands it over to the next APIC. In the conducted tests the
vast majority of interrupts ends up on the APIC with the lowest ID anyway,
so there is no natural spreading of the interrupts possible.”
"

I think this explains why even if we have multiple CPUs in the SMP affinity
mask the interrupts may only land on CPU0.

With Jesse's patch in the kernel initial affinity mask that included
multiple CPUs is overwritten with a single CPU. This CPU was previously
selected based on vector_index, later on, this has been replaced with a logic
where the CPU was fetched from cpumask_local_spread. Hence, in this
case, the interrupts will be spread across to different CPUs.

# listing the IRQ smp_affinity_list on the v4.0.0 kernel with Jesse's patch
  $ /proc/irq/60/p2p1-TxRx-0
    0
  $ /proc/irq/61/p2p1-TxRx-1
    1
  $ /proc/irq/62/p2p1-TxRx-2
    2
    ...
  $ /proc/irq/65/p2p1-TxRx-5
    5
  $ /proc/irq/66/p2p1-TxRx-6
    6
   ...



Part-2: Why this may not be a problem anymore?
------
With the latest kernel, if we check the effective_affinity_list for i40e
IRQs without irqblance and with Jesse's patch reverted, it is already set
to a single CPU that is not always 0. This CPU is retrieved based on vector
allocation count i.e., we get a CPU that has the lowest vector
allocation count.

  $ /proc/irq/100/i40e-eno1-TxRx-5
    28
  $ /proc/irq/101/i40e-eno1-TxRx-6
    30
  $ /proc/irq/102/i40e-eno1-TxRx-7
    32
    …
  $ /proc/irq/121/i40e-eno1-TxRx-18
    16
  $ /proc/irq/122/i40e-eno1-TxRx-19
    18
    ..

@Jesse do you think the Part-1 findings explain the behavior that you have
observed in the past?

Also, let me know if there are any suggestions or experiments to try here.


--
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-04-29 21:44                                     ` Nitesh Lal
@ 2021-04-30  1:48                                       ` Jesse Brandeburg
  2021-04-30 13:10                                         ` Nitesh Lal
  2021-04-30  7:10                                       ` Thomas Gleixner
  1 sibling, 1 reply; 52+ messages in thread
From: Jesse Brandeburg @ 2021-04-30  1:48 UTC (permalink / raw)
  To: Nitesh Lal
  Cc: Thomas Gleixner, frederic, juri.lelli, Marcelo Tosatti, abelits,
	Robin Murphy, linux-kernel, linux-api, bhelgaas, linux-pci,
	rostedt, mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun, netdev, chris.friesen

Nitesh Lal wrote:

> @Jesse do you think the Part-1 findings explain the behavior that you have
> observed in the past?
> 
> Also, let me know if there are any suggestions or experiments to try here.

Wow Nitesh, nice work! That's quite a bit of spelunking you had to do
there!

Your results that show the older kernels with ranged affinity issues is
consistent with what I remember from that time, and the original
problem.

I'm glad to see that a) Thomas fixed the kernel to even do better than
ranged affinity masks, and that b) if you revert my patch, the new
behavior is better and still maintains the fix from a).

For me this explains the whole picture and makes me feel comfortable
with the patch that reverts the initial affinity mask (that also
introduces a subtle bug with the reserved CPUs that I believe you've
noted already).

Thanks for this work!
Jesse

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-04-29 21:44                                     ` Nitesh Lal
  2021-04-30  1:48                                       ` Jesse Brandeburg
@ 2021-04-30  7:10                                       ` Thomas Gleixner
  2021-04-30 16:14                                         ` Nitesh Lal
  1 sibling, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2021-04-30  7:10 UTC (permalink / raw)
  To: Nitesh Lal, Jesse Brandeburg, frederic, juri.lelli,
	Marcelo Tosatti, abelits
  Cc: Robin Murphy, linux-kernel, linux-api, bhelgaas, linux-pci,
	rostedt, mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun, netdev, chris.friesen

Nitesh,

On Thu, Apr 29 2021 at 17:44, Nitesh Lal wrote:

First of all: Nice analysis, well done!

> So to understand further what the problem was with the older kernel based
> on Jesse's description and whether it is still there I did some more
> digging. Following are some of the findings (kindly correct me if
> there is a gap in my understanding):
>
> Part-1: Why there was a problem with the older kernel?
> ------
> With a kernel built on top of the tag v4.0.0 (with Jesse's patch reverted
> and irqbalance disabled), if we observe the/proc/irq for ixgbe device IRQs
> then there are two things to note:
>
> # No separate effective affinity (Since it has been introduced as a part of
>   the 2017 IRQ re-work)
>   $ ls /proc/irq/86/
>     affinity_hint  node  p2p1  smp_affinity  smp_affinity_list  spurious
>
> # Multiple CPUs are set in the smp_affinity_list and the first CPU is CPU0:
>
>   $ proc/irq/60/p2p1-TxRx-0
>     0,2,4,6,8,10,12,14,16,18,20,22
>
>   $ /proc/irq/61/p2p1-TxRx-1
>     0,2,4,6,8,10,12,14,16,18,20,22
>
>   $ /proc/irq/62/p2p1-TxRx-2
>     0,2,4,6,8,10,12,14,16,18,20,22
>      ...
>
>
> Now,  if we read the commit message from Thomas's patch that was part of
> this IRQ re-work:
> fdba46ff:  x86/apic: Get rid of multi CPU affinity
> "
> ..
> 2) Experiments have shown that the benefit of multi CPU affinity is close
>    to zero and in some tests even worse than setting the affinity to a single
>    CPU.
>
> The reason for this is that the delivery targets the APIC with the lowest
> ID first and only if that APIC is busy (servicing an interrupt, i.e. ISR is
> not empty) it hands it over to the next APIC. In the conducted tests the
> vast majority of interrupts ends up on the APIC with the lowest ID anyway,
> so there is no natural spreading of the interrupts possible.”
> "
>
> I think this explains why even if we have multiple CPUs in the SMP affinity
> mask the interrupts may only land on CPU0.

There are two issues in the pre rework vector management:

  1) The allocation logic itself which preferred lower numbered CPUs and
     did not try to spread out the vectors accross CPUs. This was pretty
     much true for any APIC addressing mode.

  2) The multi CPU affinity support if supported by the APIC
     mode. That's restricted to logical APIC addressing mode. That is
     available for non X2APIC up to 8 CPUs and with X2APIC it requires
     to be in cluster mode.
     
     All other addressing modes had a single CPU target selected under
     the hood which due to #1 was ending up on CPU0 most of the time at
     least up to the point where it still had vectors available.

     Also logical addressing mode with multiple target CPUs was subject
     to #1 and due to the delivery logic the lowest numbered CPU (APIC)
     was where most interrupts ended up.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-04-30  1:48                                       ` Jesse Brandeburg
@ 2021-04-30 13:10                                         ` Nitesh Lal
  0 siblings, 0 replies; 52+ messages in thread
From: Nitesh Lal @ 2021-04-30 13:10 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Thomas Gleixner, frederic, juri.lelli, Marcelo Tosatti, abelits,
	Robin Murphy, linux-kernel, linux-api, bhelgaas, linux-pci,
	rostedt, mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun, netdev, chris.friesen

On Thu, Apr 29, 2021 at 9:48 PM Jesse Brandeburg
<jesse.brandeburg@intel.com> wrote:
>
> Nitesh Lal wrote:
>
> > @Jesse do you think the Part-1 findings explain the behavior that you have
> > observed in the past?
> >
> > Also, let me know if there are any suggestions or experiments to try here.
>
> Wow Nitesh, nice work! That's quite a bit of spelunking you had to do
> there!
>
> Your results that show the older kernels with ranged affinity issues is
> consistent with what I remember from that time, and the original
> problem.

That's nice.

>
> I'm glad to see that a) Thomas fixed the kernel to even do better than
> ranged affinity masks, and that b) if you revert my patch, the new
> behavior is better and still maintains the fix from a).

Right, the interrupts are naturally spread now.

>
> For me this explains the whole picture and makes me feel comfortable
> with the patch that reverts the initial affinity mask (that also
> introduces a subtle bug with the reserved CPUs that I believe you've
> noted already).
>

Thank you for confirming!

--
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-04-30  7:10                                       ` Thomas Gleixner
@ 2021-04-30 16:14                                         ` Nitesh Lal
  2021-04-30 18:21                                           ` Thomas Gleixner
  0 siblings, 1 reply; 52+ messages in thread
From: Nitesh Lal @ 2021-04-30 16:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jesse Brandeburg, frederic, juri.lelli, Marcelo Tosatti, abelits,
	Robin Murphy, linux-kernel, linux-api, bhelgaas, linux-pci,
	rostedt, mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun, netdev, chris.friesen

On Fri, Apr 30, 2021 at 3:10 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Nitesh,
>
> On Thu, Apr 29 2021 at 17:44, Nitesh Lal wrote:
>
> First of all: Nice analysis, well done!

Thanks, Thomas.

>
> > So to understand further what the problem was with the older kernel based
> > on Jesse's description and whether it is still there I did some more
> > digging. Following are some of the findings (kindly correct me if
> > there is a gap in my understanding):

<snip>

> >
> > I think this explains why even if we have multiple CPUs in the SMP affinity
> > mask the interrupts may only land on CPU0.
>
> There are two issues in the pre rework vector management:
>
>   1) The allocation logic itself which preferred lower numbered CPUs and
>      did not try to spread out the vectors accross CPUs. This was pretty
>      much true for any APIC addressing mode.
>
>   2) The multi CPU affinity support if supported by the APIC
>      mode. That's restricted to logical APIC addressing mode. That is
>      available for non X2APIC up to 8 CPUs and with X2APIC it requires
>      to be in cluster mode.
>
>      All other addressing modes had a single CPU target selected under
>      the hood which due to #1 was ending up on CPU0 most of the time at
>      least up to the point where it still had vectors available.
>
>      Also logical addressing mode with multiple target CPUs was subject
>      to #1 and due to the delivery logic the lowest numbered CPU (APIC)
>      was where most interrupts ended up.
>

Right, thank you for confirming.


Based on this analysis and the fact that with your re-work the interrupts
seems to be naturally spread across the CPUs, will it be safe to revert
Jesse's patch

e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()

as it overwrites the previously set IRQ affinity mask for some of the
devices?

IMHO if we think that this patch is still solving some issue other than
what Jesse has mentioned then perhaps we should reproduce that and fix it
directly from the request_irq code path.

-- 
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-04-30 16:14                                         ` Nitesh Lal
@ 2021-04-30 18:21                                           ` Thomas Gleixner
  2021-04-30 21:07                                             ` Nitesh Lal
  0 siblings, 1 reply; 52+ messages in thread
From: Thomas Gleixner @ 2021-04-30 18:21 UTC (permalink / raw)
  To: Nitesh Lal
  Cc: Jesse Brandeburg, frederic, juri.lelli, Marcelo Tosatti, abelits,
	Robin Murphy, linux-kernel, linux-api, bhelgaas, linux-pci,
	rostedt, mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun, netdev, chris.friesen

Nitesh,

On Fri, Apr 30 2021 at 12:14, Nitesh Lal wrote:
> Based on this analysis and the fact that with your re-work the interrupts
> seems to be naturally spread across the CPUs, will it be safe to revert
> Jesse's patch
>
> e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
>
> as it overwrites the previously set IRQ affinity mask for some of the
> devices?

That's a good question. My gut feeling says yes.

> IMHO if we think that this patch is still solving some issue other than
> what Jesse has mentioned then perhaps we should reproduce that and fix it
> directly from the request_irq code path.

Makes sense.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-04-30 18:21                                           ` Thomas Gleixner
@ 2021-04-30 21:07                                             ` Nitesh Lal
  2021-05-01  2:21                                               ` Jesse Brandeburg
  0 siblings, 1 reply; 52+ messages in thread
From: Nitesh Lal @ 2021-04-30 21:07 UTC (permalink / raw)
  To: Thomas Gleixner, Jesse Brandeburg
  Cc: frederic, juri.lelli, Marcelo Tosatti, abelits, Robin Murphy,
	linux-kernel, linux-api, bhelgaas, linux-pci, rostedt, mingo,
	peterz, davem, akpm, sfr, stephen, rppt, jinyuqi, zhangshaokun,
	netdev, chris.friesen

On Fri, Apr 30, 2021 at 2:21 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Nitesh,
>
> On Fri, Apr 30 2021 at 12:14, Nitesh Lal wrote:
> > Based on this analysis and the fact that with your re-work the interrupts
> > seems to be naturally spread across the CPUs, will it be safe to revert
> > Jesse's patch
> >
> > e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
> >
> > as it overwrites the previously set IRQ affinity mask for some of the
> > devices?
>
> That's a good question. My gut feeling says yes.
>

Jesse do you want to send the revert for the patch?

Also, I think it was you who suggested cc'ing
intel-wired-lan ml as that allows intel folks, to do some initial
testing?
If so, we can do that here (IMHO).

> > IMHO if we think that this patch is still solving some issue other than
> > what Jesse has mentioned then perhaps we should reproduce that and fix it
> > directly from the request_irq code path.
>
> Makes sense.
>
> Thanks,
>
>         tglx
>


-- 
Thanks
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-04-30 21:07                                             ` Nitesh Lal
@ 2021-05-01  2:21                                               ` Jesse Brandeburg
  2021-05-03 13:15                                                 ` Nitesh Lal
  0 siblings, 1 reply; 52+ messages in thread
From: Jesse Brandeburg @ 2021-05-01  2:21 UTC (permalink / raw)
  To: Nitesh Lal
  Cc: Thomas Gleixner, frederic, juri.lelli, Marcelo Tosatti, abelits,
	Robin Murphy, linux-kernel, linux-api, bhelgaas, linux-pci,
	rostedt, mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun, netdev, chris.friesen

Nitesh Lal wrote:

> On Fri, Apr 30, 2021 at 2:21 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Nitesh,
> >
> > On Fri, Apr 30 2021 at 12:14, Nitesh Lal wrote:
> > > Based on this analysis and the fact that with your re-work the interrupts
> > > seems to be naturally spread across the CPUs, will it be safe to revert
> > > Jesse's patch
> > >
> > > e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
> > >
> > > as it overwrites the previously set IRQ affinity mask for some of the
> > > devices?
> >
> > That's a good question. My gut feeling says yes.
> >
> 
> Jesse do you want to send the revert for the patch?
> 
> Also, I think it was you who suggested cc'ing
> intel-wired-lan ml as that allows intel folks, to do some initial
> testing?
> If so, we can do that here (IMHO).

Patch sent here:
https://lore.kernel.org/lkml/20210501021832.743094-1-jesse.brandeburg@intel.com/T/#u

Any testing appreciated!

Jesse

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
  2021-05-01  2:21                                               ` Jesse Brandeburg
@ 2021-05-03 13:15                                                 ` Nitesh Lal
  0 siblings, 0 replies; 52+ messages in thread
From: Nitesh Lal @ 2021-05-03 13:15 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Thomas Gleixner, frederic, juri.lelli, Marcelo Tosatti, abelits,
	Robin Murphy, linux-kernel, linux-api, bhelgaas, linux-pci,
	rostedt, mingo, peterz, davem, akpm, sfr, stephen, rppt, jinyuqi,
	zhangshaokun, netdev, chris.friesen

On Fri, Apr 30, 2021 at 10:21 PM Jesse Brandeburg
<jesse.brandeburg@intel.com> wrote:
>
> Nitesh Lal wrote:
>
> > On Fri, Apr 30, 2021 at 2:21 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Nitesh,
> > >
> > > On Fri, Apr 30 2021 at 12:14, Nitesh Lal wrote:
> > > > Based on this analysis and the fact that with your re-work the interrupts
> > > > seems to be naturally spread across the CPUs, will it be safe to revert
> > > > Jesse's patch
> > > >
> > > > e2e64a932 genirq: Set initial affinity in irq_set_affinity_hint()
> > > >
> > > > as it overwrites the previously set IRQ affinity mask for some of the
> > > > devices?
> > >
> > > That's a good question. My gut feeling says yes.
> > >
> >
> > Jesse do you want to send the revert for the patch?
> >
> > Also, I think it was you who suggested cc'ing
> > intel-wired-lan ml as that allows intel folks, to do some initial
> > testing?
> > If so, we can do that here (IMHO).
>
> Patch sent here:
> https://lore.kernel.org/lkml/20210501021832.743094-1-jesse.brandeburg@intel.com/T/#u
>
> Any testing appreciated!
>
> Jesse
>

Thank you!

--
Nitesh


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2021-05-03 13:16 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25 22:34 [PATCH v4 0/3] Preventing job distribution to isolated CPUs Nitesh Narayan Lal
2020-06-25 22:34 ` [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs Nitesh Narayan Lal
2020-06-29 16:11   ` Nitesh Narayan Lal
2020-07-01  0:32     ` Andrew Morton
2020-07-01  0:47       ` Nitesh Narayan Lal
2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Alex Belits
2021-01-27 11:57   ` [Patch v4 1/3] " Robin Murphy
2021-01-27 12:19     ` Marcelo Tosatti
2021-01-27 12:36       ` Robin Murphy
2021-01-27 13:09         ` Marcelo Tosatti
2021-01-27 13:49           ` Robin Murphy
2021-01-27 14:16           ` Nitesh Narayan Lal
2021-01-28 15:56           ` Thomas Gleixner
2021-01-28 16:33             ` Marcelo Tosatti
     [not found]             ` <02ac9d85-7ddd-96da-1252-4663feea7c9f@marvell.com>
2021-02-01 17:50               ` [EXT] " Marcelo Tosatti
2021-01-28 16:02       ` Thomas Gleixner
2021-01-28 16:59         ` Marcelo Tosatti
2021-01-28 17:35           ` Nitesh Narayan Lal
2021-01-28 20:01           ` Thomas Gleixner
     [not found]             ` <d2a4dc97-a9ed-e0e7-3b9c-c56ae46f6608@redhat.com>
     [not found]               ` <20210129142356.GB40876@fuller.cnet>
2021-01-29 17:34                 ` [EXT] " Alex Belits
     [not found]                 ` <18584612-868c-0f88-5de2-dc93c8638816@redhat.com>
2021-02-05 19:56                   ` Thomas Gleixner
2021-02-04 18:15             ` Marcelo Tosatti
2021-02-04 18:47               ` Nitesh Narayan Lal
2021-02-04 19:06                 ` Marcelo Tosatti
2021-02-04 19:17                   ` Nitesh Narayan Lal
2021-02-05 22:23                     ` Thomas Gleixner
2021-02-05 22:26                       ` Thomas Gleixner
2021-02-05 23:02                       ` [tip: sched/urgent] Revert "lib: Restrict cpumask_local_spread to houskeeping CPUs" tip-bot2 for Thomas Gleixner
2021-02-07  0:43                       ` [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs Nitesh Narayan Lal
2021-02-11 15:55                         ` Nitesh Narayan Lal
2021-03-04 18:15                           ` Nitesh Narayan Lal
     [not found]                             ` <faa8d84e-db67-7fbe-891e-f4987f106b20@marvell.com>
2021-03-04 23:23                               ` [EXT] " Nitesh Narayan Lal
2021-04-06 17:22                             ` Jesse Brandeburg
2021-04-07 15:18                               ` Nitesh Narayan Lal
2021-04-08 18:49                                 ` Nitesh Narayan Lal
2021-04-14 16:11                                 ` Jesse Brandeburg
2021-04-15 22:11                                   ` Nitesh Narayan Lal
2021-04-29 21:44                                     ` Nitesh Lal
2021-04-30  1:48                                       ` Jesse Brandeburg
2021-04-30 13:10                                         ` Nitesh Lal
2021-04-30  7:10                                       ` Thomas Gleixner
2021-04-30 16:14                                         ` Nitesh Lal
2021-04-30 18:21                                           ` Thomas Gleixner
2021-04-30 21:07                                             ` Nitesh Lal
2021-05-01  2:21                                               ` Jesse Brandeburg
2021-05-03 13:15                                                 ` Nitesh Lal
2020-06-25 22:34 ` [Patch v4 2/3] PCI: Restrict probe functions to housekeeping CPUs Nitesh Narayan Lal
2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Alex Belits
2020-06-25 22:34 ` [Patch v4 3/3] net: Restrict receive packets queuing " Nitesh Narayan Lal
2020-06-26 11:14   ` Peter Zijlstra
2020-06-26 17:20     ` David Miller
2020-07-09  8:45   ` [tip: sched/core] " tip-bot2 for Alex Belits

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.