linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Kernel-managed IRQ affinity (cont)
@ 2019-12-16 19:57 Peter Xu
  2019-12-19  8:28 ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2019-12-16 19:57 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Juri Lelli, Ming Lei, Linux Kernel Mailing List, Peter Xu

Hi, Thomas,

(Sorry I must have lost the discussion during an email migration, so
 I'll start with a new one)

This is a continued discussion of previous one on kernel managed IRQ
affinity [1].  I think at that time the conclusion is that we don't
have a usage scenario to change current policy [2].  However recently
I noticed that it is probably a very fundamental requirement for some
real-time scenarios, even when there's no multi-queue involved.

In my test case, it was a very common realtime guest with 10 vcpus,
0-1 are housekeeping vcpus, 2-9 are realtime vcpus.  The guest has one
virtio-blk device as boot disk.  With a distribution very close to
latest upstream, we can observe high spikes, probably due to the IRQs.

To guarantee realtime responsiveness, we need to make sure the IRQs
will be managable, say, when I run a real-time workload on vcpu9, we
should be able to move all the IRQs from vcpu9 to the other vcpus
(most probably vcpu0 and vcpu1).  However with the kernel managed IRQs
we can't echo to /proc/irq/N/smp_affinity.  Here, vcpu9 gets IRQ 38
from the virtio-blk device:

  # cat /proc/interrupts | grep -w 38
  38: 0 0 0 0 0 0 0 0 0 15206 PCI-MSI 2621441-edge virtio2-req.0
  # cat /proc/irq/38/smp_affinity
  3ff
  # cat /proc/irq/38/effective_affinity
  200

Meanwhile, I don't think there's anything special for VMs, so this
issue should exist even for hosts as long as the IRQ is managed in the
same way here as the virtio-blk device.

As Ming has mentioned in previous discussions [3], I think it would be
at least good if the kernel IRQ system can respect "irqaffinity=" when
assigning IRQs to the cores.  Currently it's not.  What would you
suggest in this case?  Do you think this is a valid user scenario?

Thanks,

[1] https://lkml.org/lkml/2019/3/18/15
[2] https://lkml.org/lkml/2019/3/25/562
[3] https://lkml.org/lkml/2019/3/25/308

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel-managed IRQ affinity (cont)
  2019-12-16 19:57 Kernel-managed IRQ affinity (cont) Peter Xu
@ 2019-12-19  8:28 ` Ming Lei
  2019-12-19 14:32   ` Peter Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2019-12-19  8:28 UTC (permalink / raw)
  To: Peter Xu; +Cc: Thomas Gleixner, Juri Lelli, Ming Lei, Linux Kernel Mailing List

Hi Peter,

On Mon, Dec 16, 2019 at 02:57:12PM -0500, Peter Xu wrote:
> Hi, Thomas,
> 
> (Sorry I must have lost the discussion during an email migration, so
>  I'll start with a new one)
> 
> This is a continued discussion of previous one on kernel managed IRQ
> affinity [1].  I think at that time the conclusion is that we don't
> have a usage scenario to change current policy [2].  However recently
> I noticed that it is probably a very fundamental requirement for some
> real-time scenarios, even when there's no multi-queue involved.
> 
> In my test case, it was a very common realtime guest with 10 vcpus,
> 0-1 are housekeeping vcpus, 2-9 are realtime vcpus.  The guest has one
> virtio-blk device as boot disk.  With a distribution very close to
> latest upstream, we can observe high spikes, probably due to the IRQs.
> 
> To guarantee realtime responsiveness, we need to make sure the IRQs
> will be managable, say, when I run a real-time workload on vcpu9, we
> should be able to move all the IRQs from vcpu9 to the other vcpus
> (most probably vcpu0 and vcpu1).  However with the kernel managed IRQs
> we can't echo to /proc/irq/N/smp_affinity.  Here, vcpu9 gets IRQ 38
> from the virtio-blk device:
> 
>   # cat /proc/interrupts | grep -w 38
>   38: 0 0 0 0 0 0 0 0 0 15206 PCI-MSI 2621441-edge virtio2-req.0
>   # cat /proc/irq/38/smp_affinity
>   3ff
>   # cat /proc/irq/38/effective_affinity
>   200
> 
> Meanwhile, I don't think there's anything special for VMs, so this
> issue should exist even for hosts as long as the IRQ is managed in the
> same way here as the virtio-blk device.
> 
> As Ming has mentioned in previous discussions [3], I think it would be
> at least good if the kernel IRQ system can respect "irqaffinity=" when
> assigning IRQs to the cores.  Currently it's not.  What would you
> suggest in this case?  Do you think this is a valid user scenario?
> 
> Thanks,
> 
> [1] https://lkml.org/lkml/2019/3/18/15
> [2] https://lkml.org/lkml/2019/3/25/562
> [3] https://lkml.org/lkml/2019/3/25/308

The following patch supposes to implementation the requirement for you,
can you test it by passing 'isolcpus=managed_irq,X-Y'?

With this kind of change, you can't run any IO from any isolated
CPU core, otherwise, unpredictable error may be triggered, either oops or
IO hang.

Another conservative approach is to only select effective CPU from
non-isolated cpus, however, the assigned CPUs may not be balanced among
interrupt vectors. But it is safer, since the system still works even if
someone submits IO from any isolated cpu core.

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 6c8512d3be88..0fbcbacd1b29 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -13,6 +13,7 @@ enum hk_flags {
 	HK_FLAG_TICK		= (1 << 4),
 	HK_FLAG_DOMAIN		= (1 << 5),
 	HK_FLAG_WQ		= (1 << 6),
+	HK_FLAG_MANAGED_IRQ	= (1 << 7),
 };
 
 #ifdef CONFIG_CPU_ISOLATION
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 4d89ad4fae3b..f18f6fa29b73 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -8,6 +8,7 @@
 #include <linux/slab.h>
 #include <linux/cpu.h>
 #include <linux/sort.h>
+#include <linux/sched/isolation.h>
 
 static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 				unsigned int cpus_per_vec)
@@ -269,6 +270,8 @@ static int __irq_build_affinity_masks(unsigned int startvec,
 	 */
 	if (numvecs <= nodes) {
 		for_each_node_mask(n, nodemsk) {
+			if (!cpumask_intersects(cpu_mask, node_to_cpumask[n]))
+				continue;
 			cpumask_or(&masks[curvec].mask, &masks[curvec].mask,
 				   node_to_cpumask[n]);
 			if (++curvec == last_affv)
@@ -341,26 +344,29 @@ static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs,
 {
 	unsigned int curvec = startvec, nr_present = 0, nr_others = 0;
 	cpumask_var_t *node_to_cpumask;
-	cpumask_var_t nmsk, npresmsk;
+	cpumask_var_t nmsk, cpumsk;
 	int ret = -ENOMEM;
+	const struct cpumask * housekeeping_mask =
+		housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
 
 	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
 		return ret;
 
-	if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
+	if (!zalloc_cpumask_var(&cpumsk, GFP_KERNEL))
 		goto fail_nmsk;
 
 	node_to_cpumask = alloc_node_to_cpumask();
 	if (!node_to_cpumask)
-		goto fail_npresmsk;
+		goto fail_cpumsk;
 
 	/* Stabilize the cpumasks */
 	get_online_cpus();
 	build_node_to_cpumask(node_to_cpumask);
 
+	cpumask_and(cpumsk, housekeeping_mask, cpu_present_mask);
 	/* Spread on present CPUs starting from affd->pre_vectors */
 	ret = __irq_build_affinity_masks(curvec, numvecs, firstvec,
-					 node_to_cpumask, cpu_present_mask,
+					 node_to_cpumask, cpumsk,
 					 nmsk, masks);
 	if (ret < 0)
 		goto fail_build_affinity;
@@ -376,9 +382,10 @@ static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs,
 		curvec = firstvec;
 	else
 		curvec = firstvec + nr_present;
-	cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
+	cpumask_andnot(cpumsk, cpu_possible_mask, cpu_present_mask);
+	cpumask_and(cpumsk, cpumsk, housekeeping_mask);
 	ret = __irq_build_affinity_masks(curvec, numvecs, firstvec,
-					 node_to_cpumask, npresmsk, nmsk,
+					 node_to_cpumask, cpumsk, nmsk,
 					 masks);
 	if (ret >= 0)
 		nr_others = ret;
@@ -391,8 +398,8 @@ static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs,
 
 	free_node_to_cpumask(node_to_cpumask);
 
- fail_npresmsk:
-	free_cpumask_var(npresmsk);
+ fail_cpumsk:
+	free_cpumask_var(cpumsk);
 
  fail_nmsk:
 	free_cpumask_var(nmsk);
@@ -405,6 +412,16 @@ static void default_calc_sets(struct irq_affinity *affd, unsigned int affvecs)
 	affd->set_size[0] = affvecs;
 }
 
+static void set_default_affinity(cpumask_var_t mask)
+{
+	const struct cpumask * housekeeping_mask =
+		housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
+
+	cpumask_and(mask, housekeeping_mask, irq_default_affinity);
+	if (cpumask_weight(mask) == 0)
+		cpumask_and(mask, housekeeping_mask, cpu_possible_mask);
+}
+
 /**
  * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
  * @nvecs:	The total number of vectors
@@ -452,7 +469,7 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 
 	/* Fill out vectors at the beginning that don't need affinity */
 	for (curvec = 0; curvec < affd->pre_vectors; curvec++)
-		cpumask_copy(&masks[curvec].mask, irq_default_affinity);
+		set_default_affinity(&masks[curvec].mask);
 
 	/*
 	 * Spread on present CPUs starting from affd->pre_vectors. If we
@@ -478,7 +495,7 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	else
 		curvec = affd->pre_vectors + usedvecs;
 	for (; curvec < nvecs; curvec++)
-		cpumask_copy(&masks[curvec].mask, irq_default_affinity);
+		set_default_affinity(&masks[curvec].mask);
 
 	/* Mark the managed interrupts */
 	for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 9fcb2a695a41..008d6ac2342b 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -163,6 +163,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
 			continue;
 		}
 
+		if (!strncmp(str, "managed_irq,", 12)) {
+			str += 12;
+			flags |= HK_FLAG_MANAGED_IRQ;
+			continue;
+		}
+
 		pr_warn("isolcpus: Error, unknown flag\n");
 		return 0;
 	}

-- 
Ming


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Kernel-managed IRQ affinity (cont)
  2019-12-19  8:28 ` Ming Lei
@ 2019-12-19 14:32   ` Peter Xu
  2019-12-19 16:11     ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Xu @ 2019-12-19 14:32 UTC (permalink / raw)
  To: Ming Lei; +Cc: Thomas Gleixner, Juri Lelli, Ming Lei, Linux Kernel Mailing List

On Thu, Dec 19, 2019 at 04:28:19PM +0800, Ming Lei wrote:
> Hi Peter,

Hi, Ming,

> 
> On Mon, Dec 16, 2019 at 02:57:12PM -0500, Peter Xu wrote:
> > Hi, Thomas,
> > 
> > (Sorry I must have lost the discussion during an email migration, so
> >  I'll start with a new one)
> > 
> > This is a continued discussion of previous one on kernel managed IRQ
> > affinity [1].  I think at that time the conclusion is that we don't
> > have a usage scenario to change current policy [2].  However recently
> > I noticed that it is probably a very fundamental requirement for some
> > real-time scenarios, even when there's no multi-queue involved.
> > 
> > In my test case, it was a very common realtime guest with 10 vcpus,
> > 0-1 are housekeeping vcpus, 2-9 are realtime vcpus.  The guest has one
> > virtio-blk device as boot disk.  With a distribution very close to
> > latest upstream, we can observe high spikes, probably due to the IRQs.
> > 
> > To guarantee realtime responsiveness, we need to make sure the IRQs
> > will be managable, say, when I run a real-time workload on vcpu9, we
> > should be able to move all the IRQs from vcpu9 to the other vcpus
> > (most probably vcpu0 and vcpu1).  However with the kernel managed IRQs
> > we can't echo to /proc/irq/N/smp_affinity.  Here, vcpu9 gets IRQ 38
> > from the virtio-blk device:
> > 
> >   # cat /proc/interrupts | grep -w 38
> >   38: 0 0 0 0 0 0 0 0 0 15206 PCI-MSI 2621441-edge virtio2-req.0
> >   # cat /proc/irq/38/smp_affinity
> >   3ff
> >   # cat /proc/irq/38/effective_affinity
> >   200
> > 
> > Meanwhile, I don't think there's anything special for VMs, so this
> > issue should exist even for hosts as long as the IRQ is managed in the
> > same way here as the virtio-blk device.
> > 
> > As Ming has mentioned in previous discussions [3], I think it would be
> > at least good if the kernel IRQ system can respect "irqaffinity=" when
> > assigning IRQs to the cores.  Currently it's not.  What would you
> > suggest in this case?  Do you think this is a valid user scenario?
> > 
> > Thanks,
> > 
> > [1] https://lkml.org/lkml/2019/3/18/15
> > [2] https://lkml.org/lkml/2019/3/25/562
> > [3] https://lkml.org/lkml/2019/3/25/308
> 
> The following patch supposes to implementation the requirement for you,
> can you test it by passing 'isolcpus=managed_irq,X-Y'?

I really appreciate your patch!  I'll keep this version, while before
I start to test it...

> 
> With this kind of change, you can't run any IO from any isolated
> CPU core, otherwise, unpredictable error may be triggered, either oops or
> IO hang.

... I'm not sure whether this can be acceptable for a production
environment.

In our case, the IRQ should come from virtio-blk which is the root
disk, so I assume even the RT core could use it at least when loading
the executable into RAM.  So...

> 
> Another conservative approach is to only select effective CPU from
> non-isolated cpus, however, the assigned CPUs may not be balanced among
> interrupt vectors. But it is safer, since the system still works even if
> someone submits IO from any isolated cpu core.

... this one seems to be more appealing at least to me.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel-managed IRQ affinity (cont)
  2019-12-19 14:32   ` Peter Xu
@ 2019-12-19 16:11     ` Ming Lei
  2019-12-19 18:09       ` Peter Xu
  2020-01-09 20:02       ` Thomas Gleixner
  0 siblings, 2 replies; 12+ messages in thread
From: Ming Lei @ 2019-12-19 16:11 UTC (permalink / raw)
  To: Peter Xu; +Cc: Thomas Gleixner, Juri Lelli, Ming Lei, Linux Kernel Mailing List

On Thu, Dec 19, 2019 at 09:32:14AM -0500, Peter Xu wrote:
> On Thu, Dec 19, 2019 at 04:28:19PM +0800, Ming Lei wrote:
> > Hi Peter,
> 
> Hi, Ming,
> 
> > 
> > On Mon, Dec 16, 2019 at 02:57:12PM -0500, Peter Xu wrote:
> > > Hi, Thomas,
> > > 
> > > (Sorry I must have lost the discussion during an email migration, so
> > >  I'll start with a new one)
> > > 
> > > This is a continued discussion of previous one on kernel managed IRQ
> > > affinity [1].  I think at that time the conclusion is that we don't
> > > have a usage scenario to change current policy [2].  However recently
> > > I noticed that it is probably a very fundamental requirement for some
> > > real-time scenarios, even when there's no multi-queue involved.
> > > 
> > > In my test case, it was a very common realtime guest with 10 vcpus,
> > > 0-1 are housekeeping vcpus, 2-9 are realtime vcpus.  The guest has one
> > > virtio-blk device as boot disk.  With a distribution very close to
> > > latest upstream, we can observe high spikes, probably due to the IRQs.
> > > 
> > > To guarantee realtime responsiveness, we need to make sure the IRQs
> > > will be managable, say, when I run a real-time workload on vcpu9, we
> > > should be able to move all the IRQs from vcpu9 to the other vcpus
> > > (most probably vcpu0 and vcpu1).  However with the kernel managed IRQs
> > > we can't echo to /proc/irq/N/smp_affinity.  Here, vcpu9 gets IRQ 38
> > > from the virtio-blk device:
> > > 
> > >   # cat /proc/interrupts | grep -w 38
> > >   38: 0 0 0 0 0 0 0 0 0 15206 PCI-MSI 2621441-edge virtio2-req.0
> > >   # cat /proc/irq/38/smp_affinity
> > >   3ff
> > >   # cat /proc/irq/38/effective_affinity
> > >   200
> > > 
> > > Meanwhile, I don't think there's anything special for VMs, so this
> > > issue should exist even for hosts as long as the IRQ is managed in the
> > > same way here as the virtio-blk device.
> > > 
> > > As Ming has mentioned in previous discussions [3], I think it would be
> > > at least good if the kernel IRQ system can respect "irqaffinity=" when
> > > assigning IRQs to the cores.  Currently it's not.  What would you
> > > suggest in this case?  Do you think this is a valid user scenario?
> > > 
> > > Thanks,
> > > 
> > > [1] https://lkml.org/lkml/2019/3/18/15
> > > [2] https://lkml.org/lkml/2019/3/25/562
> > > [3] https://lkml.org/lkml/2019/3/25/308
> > 
> > The following patch supposes to implementation the requirement for you,
> > can you test it by passing 'isolcpus=managed_irq,X-Y'?
> 
> I really appreciate your patch!  I'll keep this version, while before
> I start to test it...
> 
> > 
> > With this kind of change, you can't run any IO from any isolated
> > CPU core, otherwise, unpredictable error may be triggered, either oops or
> > IO hang.
> 
> ... I'm not sure whether this can be acceptable for a production
> environment.
> 
> In our case, the IRQ should come from virtio-blk which is the root
> disk, so I assume even the RT core could use it at least when loading
> the executable into RAM.  So...
> 
> > 
> > Another conservative approach is to only select effective CPU from
> > non-isolated cpus, however, the assigned CPUs may not be balanced among
> > interrupt vectors. But it is safer, since the system still works even if
> > someone submits IO from any isolated cpu core.
> 
> ... this one seems to be more appealing at least to me.

OK, please try the following patch:


diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 6c8512d3be88..0fbcbacd1b29 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -13,6 +13,7 @@ enum hk_flags {
 	HK_FLAG_TICK		= (1 << 4),
 	HK_FLAG_DOMAIN		= (1 << 5),
 	HK_FLAG_WQ		= (1 << 6),
+	HK_FLAG_MANAGED_IRQ	= (1 << 7),
 };
 
 #ifdef CONFIG_CPU_ISOLATION
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 1753486b440c..0a75a09cc4e8 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -20,6 +20,7 @@
 #include <linux/sched/task.h>
 #include <uapi/linux/sched/types.h>
 #include <linux/task_work.h>
+#include <linux/sched/isolation.h>
 
 #include "internals.h"
 
@@ -212,12 +213,33 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 {
 	struct irq_desc *desc = irq_data_to_desc(data);
 	struct irq_chip *chip = irq_data_get_irq_chip(data);
+	const struct cpumask *housekeeping_mask =
+		housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
 	int ret;
+	cpumask_var_t tmp_mask;
 
 	if (!chip || !chip->irq_set_affinity)
 		return -EINVAL;
 
-	ret = chip->irq_set_affinity(data, mask, force);
+	if (!zalloc_cpumask_var(&tmp_mask, GFP_KERNEL))
+		return -EINVAL;
+
+	/*
+	 * Userspace can't change managed irq's affinity, make sure
+	 * that isolated CPU won't be selected as the effective CPU
+	 * if this irq's affinity includes both isolated CPU and
+	 * housekeeping CPU.
+	 *
+	 * This way guarantees that isolated CPU won't be interrupted
+	 * by IO submitted from housekeeping CPU.
+	 */
+	if (irqd_affinity_is_managed(data) &&
+			cpumask_intersects(mask, housekeeping_mask))
+		cpumask_and(tmp_mask, mask, housekeeping_mask);
+	else
+		cpumask_copy(tmp_mask, mask);
+
+	ret = chip->irq_set_affinity(data, tmp_mask, force);
 	switch (ret) {
 	case IRQ_SET_MASK_OK:
 	case IRQ_SET_MASK_OK_DONE:
@@ -229,6 +251,8 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 		ret = 0;
 	}
 
+	free_cpumask_var(tmp_mask);
+
 	return ret;
 }
 
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 9fcb2a695a41..008d6ac2342b 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -163,6 +163,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
 			continue;
 		}
 
+		if (!strncmp(str, "managed_irq,", 12)) {
+			str += 12;
+			flags |= HK_FLAG_MANAGED_IRQ;
+			continue;
+		}
+
 		pr_warn("isolcpus: Error, unknown flag\n");
 		return 0;
 	}

-- 
Ming


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Kernel-managed IRQ affinity (cont)
  2019-12-19 16:11     ` Ming Lei
@ 2019-12-19 18:09       ` Peter Xu
  2019-12-23 19:18         ` Peter Xu
  2020-01-09 20:02       ` Thomas Gleixner
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Xu @ 2019-12-19 18:09 UTC (permalink / raw)
  To: Ming Lei; +Cc: Thomas Gleixner, Juri Lelli, Ming Lei, Linux Kernel Mailing List

On Fri, Dec 20, 2019 at 12:11:15AM +0800, Ming Lei wrote:
> OK, please try the following patch:
> 
> 
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index 6c8512d3be88..0fbcbacd1b29 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -13,6 +13,7 @@ enum hk_flags {
>  	HK_FLAG_TICK		= (1 << 4),
>  	HK_FLAG_DOMAIN		= (1 << 5),
>  	HK_FLAG_WQ		= (1 << 6),
> +	HK_FLAG_MANAGED_IRQ	= (1 << 7),
>  };
>  
>  #ifdef CONFIG_CPU_ISOLATION
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 1753486b440c..0a75a09cc4e8 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -20,6 +20,7 @@
>  #include <linux/sched/task.h>
>  #include <uapi/linux/sched/types.h>
>  #include <linux/task_work.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "internals.h"
>  
> @@ -212,12 +213,33 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
>  {
>  	struct irq_desc *desc = irq_data_to_desc(data);
>  	struct irq_chip *chip = irq_data_get_irq_chip(data);
> +	const struct cpumask *housekeeping_mask =
> +		housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
>  	int ret;
> +	cpumask_var_t tmp_mask;
>  
>  	if (!chip || !chip->irq_set_affinity)
>  		return -EINVAL;
>  
> -	ret = chip->irq_set_affinity(data, mask, force);
> +	if (!zalloc_cpumask_var(&tmp_mask, GFP_KERNEL))
> +		return -EINVAL;
> +
> +	/*
> +	 * Userspace can't change managed irq's affinity, make sure
> +	 * that isolated CPU won't be selected as the effective CPU
> +	 * if this irq's affinity includes both isolated CPU and
> +	 * housekeeping CPU.
> +	 *
> +	 * This way guarantees that isolated CPU won't be interrupted
> +	 * by IO submitted from housekeeping CPU.
> +	 */
> +	if (irqd_affinity_is_managed(data) &&
> +			cpumask_intersects(mask, housekeeping_mask))
> +		cpumask_and(tmp_mask, mask, housekeeping_mask);
> +	else
> +		cpumask_copy(tmp_mask, mask);
> +
> +	ret = chip->irq_set_affinity(data, tmp_mask, force);
>  	switch (ret) {
>  	case IRQ_SET_MASK_OK:
>  	case IRQ_SET_MASK_OK_DONE:
> @@ -229,6 +251,8 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
>  		ret = 0;
>  	}
>  
> +	free_cpumask_var(tmp_mask);
> +
>  	return ret;
>  }
>  
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 9fcb2a695a41..008d6ac2342b 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -163,6 +163,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
>  			continue;
>  		}
>  
> +		if (!strncmp(str, "managed_irq,", 12)) {
> +			str += 12;
> +			flags |= HK_FLAG_MANAGED_IRQ;
> +			continue;
> +		}
> +
>  		pr_warn("isolcpus: Error, unknown flag\n");
>  		return 0;
>  	}

Thanks for the quick patch.  I'll test after my current round of tests
finish and update.  I'll probably believe this will work for us as
long as it "functionally" works :) (after all it won't even need a RT
environment because it's really about where to put some IRQs).  So
IMHO the more important thing is whether such a solution could be
acceptable by the upstream.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel-managed IRQ affinity (cont)
  2019-12-19 18:09       ` Peter Xu
@ 2019-12-23 19:18         ` Peter Xu
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2019-12-23 19:18 UTC (permalink / raw)
  To: Ming Lei; +Cc: Thomas Gleixner, Juri Lelli, Ming Lei, Linux Kernel Mailing List

On Thu, Dec 19, 2019 at 01:09:17PM -0500, Peter Xu wrote:
> On Fri, Dec 20, 2019 at 12:11:15AM +0800, Ming Lei wrote:
> > OK, please try the following patch:
> > 
> > 
> > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> > index 6c8512d3be88..0fbcbacd1b29 100644
> > --- a/include/linux/sched/isolation.h
> > +++ b/include/linux/sched/isolation.h
> > @@ -13,6 +13,7 @@ enum hk_flags {
> >  	HK_FLAG_TICK		= (1 << 4),
> >  	HK_FLAG_DOMAIN		= (1 << 5),
> >  	HK_FLAG_WQ		= (1 << 6),
> > +	HK_FLAG_MANAGED_IRQ	= (1 << 7),
> >  };
> >  
> >  #ifdef CONFIG_CPU_ISOLATION
> > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> > index 1753486b440c..0a75a09cc4e8 100644
> > --- a/kernel/irq/manage.c
> > +++ b/kernel/irq/manage.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/sched/task.h>
> >  #include <uapi/linux/sched/types.h>
> >  #include <linux/task_work.h>
> > +#include <linux/sched/isolation.h>
> >  
> >  #include "internals.h"
> >  
> > @@ -212,12 +213,33 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> >  {
> >  	struct irq_desc *desc = irq_data_to_desc(data);
> >  	struct irq_chip *chip = irq_data_get_irq_chip(data);
> > +	const struct cpumask *housekeeping_mask =
> > +		housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
> >  	int ret;
> > +	cpumask_var_t tmp_mask;
> >  
> >  	if (!chip || !chip->irq_set_affinity)
> >  		return -EINVAL;
> >  
> > -	ret = chip->irq_set_affinity(data, mask, force);
> > +	if (!zalloc_cpumask_var(&tmp_mask, GFP_KERNEL))
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * Userspace can't change managed irq's affinity, make sure
> > +	 * that isolated CPU won't be selected as the effective CPU
> > +	 * if this irq's affinity includes both isolated CPU and
> > +	 * housekeeping CPU.
> > +	 *
> > +	 * This way guarantees that isolated CPU won't be interrupted
> > +	 * by IO submitted from housekeeping CPU.
> > +	 */
> > +	if (irqd_affinity_is_managed(data) &&
> > +			cpumask_intersects(mask, housekeeping_mask))
> > +		cpumask_and(tmp_mask, mask, housekeeping_mask);
> > +	else
> > +		cpumask_copy(tmp_mask, mask);
> > +
> > +	ret = chip->irq_set_affinity(data, tmp_mask, force);
> >  	switch (ret) {
> >  	case IRQ_SET_MASK_OK:
> >  	case IRQ_SET_MASK_OK_DONE:
> > @@ -229,6 +251,8 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> >  		ret = 0;
> >  	}
> >  
> > +	free_cpumask_var(tmp_mask);
> > +
> >  	return ret;
> >  }
> >  
> > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > index 9fcb2a695a41..008d6ac2342b 100644
> > --- a/kernel/sched/isolation.c
> > +++ b/kernel/sched/isolation.c
> > @@ -163,6 +163,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
> >  			continue;
> >  		}
> >  
> > +		if (!strncmp(str, "managed_irq,", 12)) {
> > +			str += 12;
> > +			flags |= HK_FLAG_MANAGED_IRQ;
> > +			continue;
> > +		}
> > +
> >  		pr_warn("isolcpus: Error, unknown flag\n");
> >  		return 0;
> >  	}
> 
> Thanks for the quick patch.  I'll test after my current round of tests
> finish and update.  I'll probably believe this will work for us as
> long as it "functionally" works :) (after all it won't even need a RT
> environment because it's really about where to put some IRQs).  So
> IMHO the more important thing is whether such a solution could be
> acceptable by the upstream.

I've tested this patch, it works for us.  "isolcpus=managed_irq,2-9"
gives me:

[root@rt-vm 32]# pwd
/proc/irq/32
[root@rt-vm 32]# cat smp_affinity
003
[root@rt-vm 32]# cat effective_affinity
001

Thomas, do you think Ming's solution could be accepted?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel-managed IRQ affinity (cont)
  2019-12-19 16:11     ` Ming Lei
  2019-12-19 18:09       ` Peter Xu
@ 2020-01-09 20:02       ` Thomas Gleixner
  2020-01-10  1:28         ` Ming Lei
  1 sibling, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2020-01-09 20:02 UTC (permalink / raw)
  To: Ming Lei, Peter Xu; +Cc: Juri Lelli, Ming Lei, Linux Kernel Mailing List

Ming,

Ming Lei <ming.lei@redhat.com> writes:

> On Thu, Dec 19, 2019 at 09:32:14AM -0500, Peter Xu wrote:
>> ... this one seems to be more appealing at least to me.
>
> OK, please try the following patch:
>
>
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index 6c8512d3be88..0fbcbacd1b29 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -13,6 +13,7 @@ enum hk_flags {
>  	HK_FLAG_TICK		= (1 << 4),
>  	HK_FLAG_DOMAIN		= (1 << 5),
>  	HK_FLAG_WQ		= (1 << 6),
> +	HK_FLAG_MANAGED_IRQ	= (1 << 7),
>  };
>  
>  #ifdef CONFIG_CPU_ISOLATION
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 1753486b440c..0a75a09cc4e8 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -20,6 +20,7 @@
>  #include <linux/sched/task.h>
>  #include <uapi/linux/sched/types.h>
>  #include <linux/task_work.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "internals.h"
>  
> @@ -212,12 +213,33 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
>  {
>  	struct irq_desc *desc = irq_data_to_desc(data);
>  	struct irq_chip *chip = irq_data_get_irq_chip(data);
> +	const struct cpumask *housekeeping_mask =
> +		housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
>  	int ret;
> +	cpumask_var_t tmp_mask;
>  
>  	if (!chip || !chip->irq_set_affinity)
>  		return -EINVAL;
>  
> -	ret = chip->irq_set_affinity(data, mask, force);
> +	if (!zalloc_cpumask_var(&tmp_mask, GFP_KERNEL))
> +		return -EINVAL;

That's wrong. This code is called with interrupts disabled, so
GFP_KERNEL is wrong. And NO, we won't do a GFP_ATOMIC allocation here.

> +	/*
> +	 * Userspace can't change managed irq's affinity, make sure
> +	 * that isolated CPU won't be selected as the effective CPU
> +	 * if this irq's affinity includes both isolated CPU and
> +	 * housekeeping CPU.
> +	 *
> +	 * This way guarantees that isolated CPU won't be interrupted
> +	 * by IO submitted from housekeeping CPU.
> +	 */
> +	if (irqd_affinity_is_managed(data) &&
> +			cpumask_intersects(mask, housekeeping_mask))
> +		cpumask_and(tmp_mask, mask, housekeeping_mask);

This is duct tape engineering with absolutely no semantics. I can't even
figure out the intent of this 'managed_irq' parameter.

If the intent is to keep managed device interrupts away from isolated
cores then you really want to do that when the interrupts are spread and
not in the middle of the affinity setter code.

But first you need to define how that mask should work:

 1) Exclude CPUs from managed interrupt spreading completely

 2) Exclude CPUs only when the resulting spreading contains
    housekeeping CPUs

 3) Whatever ...

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel-managed IRQ affinity (cont)
  2020-01-09 20:02       ` Thomas Gleixner
@ 2020-01-10  1:28         ` Ming Lei
  2020-01-10 19:43           ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2020-01-10  1:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Xu, Juri Lelli, Ming Lei, Linux Kernel Mailing List, linux-block

Hello Thomas,

On Thu, Jan 09, 2020 at 09:02:20PM +0100, Thomas Gleixner wrote:
> Ming,
> 
> Ming Lei <ming.lei@redhat.com> writes:
> 
> > On Thu, Dec 19, 2019 at 09:32:14AM -0500, Peter Xu wrote:
> >> ... this one seems to be more appealing at least to me.
> >
> > OK, please try the following patch:
> >
> >
> > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> > index 6c8512d3be88..0fbcbacd1b29 100644
> > --- a/include/linux/sched/isolation.h
> > +++ b/include/linux/sched/isolation.h
> > @@ -13,6 +13,7 @@ enum hk_flags {
> >  	HK_FLAG_TICK		= (1 << 4),
> >  	HK_FLAG_DOMAIN		= (1 << 5),
> >  	HK_FLAG_WQ		= (1 << 6),
> > +	HK_FLAG_MANAGED_IRQ	= (1 << 7),
> >  };
> >  
> >  #ifdef CONFIG_CPU_ISOLATION
> > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> > index 1753486b440c..0a75a09cc4e8 100644
> > --- a/kernel/irq/manage.c
> > +++ b/kernel/irq/manage.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/sched/task.h>
> >  #include <uapi/linux/sched/types.h>
> >  #include <linux/task_work.h>
> > +#include <linux/sched/isolation.h>
> >  
> >  #include "internals.h"
> >  
> > @@ -212,12 +213,33 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> >  {
> >  	struct irq_desc *desc = irq_data_to_desc(data);
> >  	struct irq_chip *chip = irq_data_get_irq_chip(data);
> > +	const struct cpumask *housekeeping_mask =
> > +		housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
> >  	int ret;
> > +	cpumask_var_t tmp_mask;
> >  
> >  	if (!chip || !chip->irq_set_affinity)
> >  		return -EINVAL;
> >  
> > -	ret = chip->irq_set_affinity(data, mask, force);
> > +	if (!zalloc_cpumask_var(&tmp_mask, GFP_KERNEL))
> > +		return -EINVAL;
> 
> That's wrong. This code is called with interrupts disabled, so
> GFP_KERNEL is wrong. And NO, we won't do a GFP_ATOMIC allocation here.

OK, looks desc->lock is held.

> 
> > +	/*
> > +	 * Userspace can't change managed irq's affinity, make sure
> > +	 * that isolated CPU won't be selected as the effective CPU
> > +	 * if this irq's affinity includes both isolated CPU and
> > +	 * housekeeping CPU.
> > +	 *
> > +	 * This way guarantees that isolated CPU won't be interrupted
> > +	 * by IO submitted from housekeeping CPU.
> > +	 */
> > +	if (irqd_affinity_is_managed(data) &&
> > +			cpumask_intersects(mask, housekeeping_mask))
> > +		cpumask_and(tmp_mask, mask, housekeeping_mask);
> 
> This is duct tape engineering with absolutely no semantics. I can't even
> figure out the intent of this 'managed_irq' parameter.

The intent is to isolate the specified CPUs from handling managed interrupt.

For non-managed interrupt, the isolation is done via userspace because
userspace is allowed to change non-manage interrupt's affinity.

> 
> If the intent is to keep managed device interrupts away from isolated
> cores then you really want to do that when the interrupts are spread and
> not in the middle of the affinity setter code.
> 
> But first you need to define how that mask should work:
> 
>  1) Exclude CPUs from managed interrupt spreading completely
> 
>  2) Exclude CPUs only when the resulting spreading contains
>     housekeeping CPUs
> 
>  3) Whatever ...

We can do that. The big problem is that the RT case can't guarantee that
IO won't be submitted from isolated CPU always. blk-mq's queue mapping
relies on the setup affinity, so un-known behavior(kernel crash, or io
hang, or other) may be caused if we exclude isolated CPUs from interrupt
affinity.

That is why I try to exclude isolated CPUs from interrupt effective affinity,
turns out the approach is simple and doable.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel-managed IRQ affinity (cont)
  2020-01-10  1:28         ` Ming Lei
@ 2020-01-10 19:43           ` Thomas Gleixner
  2020-01-11  2:48             ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2020-01-10 19:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Xu, Juri Lelli, Ming Lei, Linux Kernel Mailing List, linux-block

Ming,

Ming Lei <ming.lei@redhat.com> writes:
> On Thu, Jan 09, 2020 at 09:02:20PM +0100, Thomas Gleixner wrote:
>> Ming Lei <ming.lei@redhat.com> writes:
>>
>> This is duct tape engineering with absolutely no semantics. I can't even
>> figure out the intent of this 'managed_irq' parameter.
>
> The intent is to isolate the specified CPUs from handling managed
> interrupt.

That's what I figured, but it still does not provide semantics and works
just for specific cases.

> We can do that. The big problem is that the RT case can't guarantee that
> IO won't be submitted from isolated CPU always. blk-mq's queue mapping
> relies on the setup affinity, so un-known behavior(kernel crash, or io
> hang, or other) may be caused if we exclude isolated CPUs from interrupt
> affinity.
>
> That is why I try to exclude isolated CPUs from interrupt effective affinity,
> turns out the approach is simple and doable.

Yes, it's doable. But it still is inconsistent behaviour. Assume the
following configuration:

  8 CPUs CPU0,1 assigned for housekeeping

With 8 queues the proposed change does nothing because each queue is
mapped to exactly one CPU.

With 4 queues you get the following:

 CPU0,1       queue 0
 CPU2,3       queue 1
 CPU4,5       queue 2
 CPU6,7       queue 3

No effect on the isolated CPUs either.

With 2 queues you get the following:

 CPU0,1,2,3   queue 0
 CPU4,5,6,7   queue 1

So here the isolated CPUs 2 and 3 get the isolation, but 4-7
not. That's perhaps intended, but definitely not documented.

So you really need to make your mind up and describe what the intended
effect of this is and why you think that the result is correct.

Thanks,

       tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel-managed IRQ affinity (cont)
  2020-01-10 19:43           ` Thomas Gleixner
@ 2020-01-11  2:48             ` Ming Lei
  2020-01-14 13:45               ` Thomas Gleixner
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2020-01-11  2:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Xu, Juri Lelli, Ming Lei, Linux Kernel Mailing List, linux-block

Hi Thomas,

On Fri, Jan 10, 2020 at 08:43:14PM +0100, Thomas Gleixner wrote:
> Ming,
> 
> Ming Lei <ming.lei@redhat.com> writes:
> > On Thu, Jan 09, 2020 at 09:02:20PM +0100, Thomas Gleixner wrote:
> >> Ming Lei <ming.lei@redhat.com> writes:
> >>
> >> This is duct tape engineering with absolutely no semantics. I can't even
> >> figure out the intent of this 'managed_irq' parameter.
> >
> > The intent is to isolate the specified CPUs from handling managed
> > interrupt.
> 
> That's what I figured, but it still does not provide semantics and works
> just for specific cases.
> 
> > We can do that. The big problem is that the RT case can't guarantee that
> > IO won't be submitted from isolated CPU always. blk-mq's queue mapping
> > relies on the setup affinity, so un-known behavior(kernel crash, or io
> > hang, or other) may be caused if we exclude isolated CPUs from interrupt
> > affinity.
> >
> > That is why I try to exclude isolated CPUs from interrupt effective affinity,
> > turns out the approach is simple and doable.
> 
> Yes, it's doable. But it still is inconsistent behaviour. Assume the
> following configuration:
> 
>   8 CPUs CPU0,1 assigned for housekeeping
> 
> With 8 queues the proposed change does nothing because each queue is
> mapped to exactly one CPU.

That is expected behavior for this RT case, given userspace won't submit
IO from isolated CPUs.

> 
> With 4 queues you get the following:
> 
>  CPU0,1       queue 0
>  CPU2,3       queue 1
>  CPU4,5       queue 2
>  CPU6,7       queue 3
> 
> No effect on the isolated CPUs either.
> 
> With 2 queues you get the following:
> 
>  CPU0,1,2,3   queue 0
>  CPU4,5,6,7   queue 1
> 
> So here the isolated CPUs 2 and 3 get the isolation, but 4-7
> not. That's perhaps intended, but definitely not documented.

That is intentional change, given no IO will be submitted from 4-7
most of times in RT case, so it is fine to select effective CPU from
isolated CPUs in this case. As peter mentioned, IO may just be submitted
from isolated CPUs during booting. Once the system is setup, no IO
comes from isolated CPUs, then no interrupt is delivered to isolated
CPUs, then meet RT's requirement.

We can document this change somewhere.

> 
> So you really need to make your mind up and describe what the intended
> effect of this is and why you think that the result is correct.

In short, if there is at least one housekeeping available in the
interrupt's affinity, we choose effective CPU from housekeeping CPUs.
Otherwise, keep the current behavior wrt. selecting effective CPU.

With this approach, no interrupts can be delivered to isolated CPUs
if no IOs are submitted from these CPUs.

Please let us know if it addresses your concerns.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel-managed IRQ affinity (cont)
  2020-01-11  2:48             ` Ming Lei
@ 2020-01-14 13:45               ` Thomas Gleixner
  2020-01-14 23:38                 ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Gleixner @ 2020-01-14 13:45 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Xu, Juri Lelli, Ming Lei, Linux Kernel Mailing List, linux-block

Ming,

Ming Lei <ming.lei@redhat.com> writes:
> On Fri, Jan 10, 2020 at 08:43:14PM +0100, Thomas Gleixner wrote:
>> Ming Lei <ming.lei@redhat.com> writes:
>> > That is why I try to exclude isolated CPUs from interrupt effective affinity,
>> > turns out the approach is simple and doable.
>> 
>> Yes, it's doable. But it still is inconsistent behaviour. Assume the
>> following configuration:
>> 
>>   8 CPUs CPU0,1 assigned for housekeeping
>> 
>> With 8 queues the proposed change does nothing because each queue is
>> mapped to exactly one CPU.
>
> That is expected behavior for this RT case, given userspace won't submit
> IO from isolated CPUs.

What is _this_ RT case? We really don't implement policy for a specific
use case. If the kernel implements a policy then it has to be generally
useful and practical.

>> With 4 queues you get the following:
>> 
>>  CPU0,1       queue 0
>>  CPU2,3       queue 1
>>  CPU4,5       queue 2
>>  CPU6,7       queue 3
>> 
>> No effect on the isolated CPUs either.
>> 
>> With 2 queues you get the following:
>> 
>>  CPU0,1,2,3   queue 0
>>  CPU4,5,6,7   queue 1
>> 
>> So here the isolated CPUs 2 and 3 get the isolation, but 4-7
>> not. That's perhaps intended, but definitely not documented.
>
> That is intentional change, given no IO will be submitted from 4-7
> most of times in RT case, so it is fine to select effective CPU from
> isolated CPUs in this case. As peter mentioned, IO may just be submitted
> from isolated CPUs during booting. Once the system is setup, no IO
> comes from isolated CPUs, then no interrupt is delivered to isolated
> CPUs, then meet RT's requirement.

Again. This is a specific usecase. Is this generally applicable?

> We can document this change somewhere.

Yes, this needs to be documented very clearly with that command line
parameter.

>> So you really need to make your mind up and describe what the intended
>> effect of this is and why you think that the result is correct.
>
> In short, if there is at least one housekeeping available in the
> interrupt's affinity, we choose effective CPU from housekeeping CPUs.
> Otherwise, keep the current behavior wrt. selecting effective CPU.
>
> With this approach, no interrupts can be delivered to isolated CPUs
> if no IOs are submitted from these CPUs.
>
> Please let us know if it addresses your concerns.

Mostly. See above.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Kernel-managed IRQ affinity (cont)
  2020-01-14 13:45               ` Thomas Gleixner
@ 2020-01-14 23:38                 ` Ming Lei
  0 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2020-01-14 23:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Xu, Juri Lelli, Ming Lei, Linux Kernel Mailing List, linux-block

Hi Thomas,

On Tue, Jan 14, 2020 at 02:45:00PM +0100, Thomas Gleixner wrote:
> Ming,
> 
> Ming Lei <ming.lei@redhat.com> writes:
> > On Fri, Jan 10, 2020 at 08:43:14PM +0100, Thomas Gleixner wrote:
> >> Ming Lei <ming.lei@redhat.com> writes:
> >> > That is why I try to exclude isolated CPUs from interrupt effective affinity,
> >> > turns out the approach is simple and doable.
> >> 
> >> Yes, it's doable. But it still is inconsistent behaviour. Assume the
> >> following configuration:
> >> 
> >>   8 CPUs CPU0,1 assigned for housekeeping
> >> 
> >> With 8 queues the proposed change does nothing because each queue is
> >> mapped to exactly one CPU.
> >
> > That is expected behavior for this RT case, given userspace won't submit
> > IO from isolated CPUs.
> 
> What is _this_ RT case? We really don't implement policy for a specific
> use case. If the kernel implements a policy then it has to be generally
> useful and practical.

Maybe the word of 'RT case' isn't accurate, I thought isolated CPUs is only
used for realtime cases, at least that is Peter's usage, maybe I was
wrong.

But it can be generic for all isolated CPUs cases, in which users
don't want managed interrupts to disturb the isolated CPU cores.

> 
> >> With 4 queues you get the following:
> >> 
> >>  CPU0,1       queue 0
> >>  CPU2,3       queue 1
> >>  CPU4,5       queue 2
> >>  CPU6,7       queue 3
> >> 
> >> No effect on the isolated CPUs either.
> >> 
> >> With 2 queues you get the following:
> >> 
> >>  CPU0,1,2,3   queue 0
> >>  CPU4,5,6,7   queue 1
> >> 
> >> So here the isolated CPUs 2 and 3 get the isolation, but 4-7
> >> not. That's perhaps intended, but definitely not documented.
> >
> > That is intentional change, given no IO will be submitted from 4-7
> > most of times in RT case, so it is fine to select effective CPU from
> > isolated CPUs in this case. As peter mentioned, IO may just be submitted
> > from isolated CPUs during booting. Once the system is setup, no IO
> > comes from isolated CPUs, then no interrupt is delivered to isolated
> > CPUs, then meet RT's requirement.
> 
> Again. This is a specific usecase. Is this generally applicable?

As mentioned above, it can be applied for all isolated CPUs, when users
don't want managed interrupts to disturb these CPU cores.

> 
> > We can document this change somewhere.
> 
> Yes, this needs to be documented very clearly with that command line
> parameter.

OK, will do that in formal post.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-01-14 23:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-16 19:57 Kernel-managed IRQ affinity (cont) Peter Xu
2019-12-19  8:28 ` Ming Lei
2019-12-19 14:32   ` Peter Xu
2019-12-19 16:11     ` Ming Lei
2019-12-19 18:09       ` Peter Xu
2019-12-23 19:18         ` Peter Xu
2020-01-09 20:02       ` Thomas Gleixner
2020-01-10  1:28         ` Ming Lei
2020-01-10 19:43           ` Thomas Gleixner
2020-01-11  2:48             ` Ming Lei
2020-01-14 13:45               ` Thomas Gleixner
2020-01-14 23:38                 ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).