linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Choose CPU based on allocated IRQs
@ 2018-10-23  1:40 Long Li
  2018-10-29 20:16 ` Michael Kelley
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Long Li @ 2018-10-23  1:40 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel; +Cc: Long Li

From: Long Li <longli@microsoft.com>

In irq_matrix, "available" is set when IRQs are allocated earlier in the IRQ
assigning process.

Later, when IRQs are activated those values are not good indicators of what
CPU to choose to assign to this IRQ.

Change to choose CPU for an IRQ based on how many IRQs are already allocated
on this CPU.

Signed-off-by: Long Li <longli@microsoft.com>
---
 kernel/irq/matrix.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 6e6d467f3dec..a51689e3e7c0 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -128,7 +128,7 @@ static unsigned int matrix_alloc_area(struct irq_matrix *m, struct cpumap *cm,
 static unsigned int matrix_find_best_cpu(struct irq_matrix *m,
 					const struct cpumask *msk)
 {
-	unsigned int cpu, best_cpu, maxavl = 0;
+	unsigned int cpu, best_cpu, min_allocated = UINT_MAX;
 	struct cpumap *cm;
 
 	best_cpu = UINT_MAX;
@@ -136,11 +136,11 @@ static unsigned int matrix_find_best_cpu(struct irq_matrix *m,
 	for_each_cpu(cpu, msk) {
 		cm = per_cpu_ptr(m->maps, cpu);
 
-		if (!cm->online || cm->available <= maxavl)
+		if (!cm->online || cm->allocated > min_allocated)
 			continue;
 
 		best_cpu = cpu;
-		maxavl = cm->available;
+		min_allocated = cm->allocated;
 	}
 	return best_cpu;
 }
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* RE: [PATCH] Choose CPU based on allocated IRQs
  2018-10-23  1:40 [PATCH] Choose CPU based on allocated IRQs Long Li
@ 2018-10-29 20:16 ` Michael Kelley
  2018-10-29 21:29 ` Thomas Gleixner
  2018-10-31 18:23 ` Tested-by test Hurwitz, Sherry
  2 siblings, 0 replies; 6+ messages in thread
From: Michael Kelley @ 2018-10-29 20:16 UTC (permalink / raw)
  To: Long Li, Thomas Gleixner, linux-kernel

From: Long Li <longli@microsoft.com>  Sent: Monday, October 22, 2018 6:41 PM
> 
> In irq_matrix, "available" is set when IRQs are allocated earlier in the IRQ
> assigning process.
> 
> Later, when IRQs are activated those values are not good indicators of what
> CPU to choose to assign to this IRQ.
> 
> Change to choose CPU for an IRQ based on how many IRQs are already allocated
> on this CPU.
> 
> Signed-off-by: Long Li <longli@microsoft.com>

Reviewed-by:  Michael Kelley <mikelley@microsoft.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Choose CPU based on allocated IRQs
  2018-10-23  1:40 [PATCH] Choose CPU based on allocated IRQs Long Li
  2018-10-29 20:16 ` Michael Kelley
@ 2018-10-29 21:29 ` Thomas Gleixner
  2018-10-30 17:44   ` Long Li
  2018-10-31 18:23 ` Tested-by test Hurwitz, Sherry
  2 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2018-10-29 21:29 UTC (permalink / raw)
  To: Long Li; +Cc: LKML, Michael Kelley

Long,

On Tue, 23 Oct 2018, Long Li wrote:

thanks for this patch.

A trivial formal thing ahead. The subject line

   [PATCH] Choose CPU based on allocated IRQs

is lacking a proper subsystem prefix. In most cases you can figure the
prefix out by running 'git log path/to/file' which in this case will show
you that most commits touching this file use the prefix 'genirq/matrix:'.

So the proper subject would be:

   [PATCH] genirq/matrix: Choose CPU based on allocated IRQs

Subsystem prefixes are important to see where a patch belongs to right from
the subject. Without that it could belong to any random part of the kernel
and needs further inspection of the patch itself. This applies to both
email and to git shortlog listings.

> From: Long Li <longli@microsoft.com>
> 
> In irq_matrix, "available" is set when IRQs are allocated earlier in the IRQ
> assigning process.
> 
> Later, when IRQs are activated those values are not good indicators of what
> CPU to choose to assign to this IRQ.

Can you please explain why you think that available is the wrong indicator
and which problem you are trying to solve?

The WHY is really the most important part of a changelog.

> Change to choose CPU for an IRQ based on how many IRQs are already allocated
> on this CPU.

Looking deeper. The initial values are:

    available = alloc_size - (managed + systembits)
    allocated = 0

There are two distinct functionalities which modify 'available' and
'allocated' (omitting the reverse operations for simplicity):

1) managed interrupts

   reserve_managed()
	managed++;
	available--;

   alloc_managed()
        allocated++;

2) regular interrupts

   alloc()
	allocated++;
	available--;

So 'available' can be lower than 'allocated' depending on the number of
reserved managed interrupts, which have not yet been activated.

So for all regular interrupts we really want to look at the number of
'available' vectors because the reserved managed ones are already accounted
there and they need to be taken into account.

For the spreading of managed interrupts in alloc_managed() that's indeed a
different story and 'allocated' is more correct. But even that is not
completely accurate and can lead to the wrong result. The accurate solution
would be to account the managed _and_ allocated vectors separately and do
the spreading for managed interrupts based on that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] Choose CPU based on allocated IRQs
  2018-10-29 21:29 ` Thomas Gleixner
@ 2018-10-30 17:44   ` Long Li
  0 siblings, 0 replies; 6+ messages in thread
From: Long Li @ 2018-10-30 17:44 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: LKML, Michael Kelley

> Subject: Re: [PATCH] Choose CPU based on allocated IRQs
> 
> Long,
> 
> On Tue, 23 Oct 2018, Long Li wrote:
> 
> thanks for this patch.
> 
> A trivial formal thing ahead. The subject line
> 
>    [PATCH] Choose CPU based on allocated IRQs
> 
> is lacking a proper subsystem prefix. In most cases you can figure the prefix
> out by running 'git log path/to/file' which in this case will show you that most
> commits touching this file use the prefix 'genirq/matrix:'.
> 
> So the proper subject would be:
> 
>    [PATCH] genirq/matrix: Choose CPU based on allocated IRQs
> 
> Subsystem prefixes are important to see where a patch belongs to right from
> the subject. Without that it could belong to any random part of the kernel
> and needs further inspection of the patch itself. This applies to both email
> and to git shortlog listings.

Thank you. I will send v2 to address this.

> 
> > From: Long Li <longli@microsoft.com>
> >
> > In irq_matrix, "available" is set when IRQs are allocated earlier in
> > the IRQ assigning process.
> >
> > Later, when IRQs are activated those values are not good indicators of
> > what CPU to choose to assign to this IRQ.
> 
> Can you please explain why you think that available is the wrong indicator
> and which problem you are trying to solve?
> 
> The WHY is really the most important part of a changelog.

The problem I'm seeing is that on a very large system with multiple devices of the same class (e.g. NVMe disks, using managed IRQs), they tend to use interrupts on several CPUs on the system. Under heavy load, those several CPUs are busy while other CPU are most idling. The issue is that when NVMe call irq_matrix_alloc_managed(), the assigned the CPU is always the first CPU in the cpumask, because they check for cpumap->available that will not change after managed IRQs are reserved in irq_matrix_reserve_managed (which was called from the 1st stage of IRQ setup in irq_domain_ops->alloc).

> 
> > Change to choose CPU for an IRQ based on how many IRQs are already
> > allocated on this CPU.
> 
> Looking deeper. The initial values are:
> 
>     available = alloc_size - (managed + systembits)
>     allocated = 0
> 
> There are two distinct functionalities which modify 'available' and 'allocated'
> (omitting the reverse operations for simplicity):
> 
> 1) managed interrupts
> 
>    reserve_managed()
> 	managed++;
> 	available--;
> 
>    alloc_managed()
>         allocated++;
> 
> 2) regular interrupts
> 
>    alloc()
> 	allocated++;
> 	available--;
> 
> So 'available' can be lower than 'allocated' depending on the number of
> reserved managed interrupts, which have not yet been activated.
> 
> So for all regular interrupts we really want to look at the number of 'available'
> vectors because the reserved managed ones are already accounted there
> and they need to be taken into account.

I think "reserved managed" may not always be accurate. Reserved managed IRQs may not always get activated. For an irq_data, when irq_matrix_reserve_managed is called, all the CPUs in the cpumask are reserved. Later, only one of them is activated via the call to irq_matrix_alloc_managed(). So we end up with a number of "reserved managed" that never get used.

> 
> For the spreading of managed interrupts in alloc_managed() that's indeed a
> different story and 'allocated' is more correct. But even that is not completely
> accurate and can lead to the wrong result. The accurate solution would be to
> account the managed _and_ allocated vectors separately and do the
> spreading for managed interrupts based on that.

I think checking for "allocated" is the best approach for picking which CPU to assign for a given irq_data, since we really can't rely on "managed" to decide how busy this CPU really is. Checking for "allocated" should work for both unmanaged and managed IRQs.

> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Tested-by test
  2018-10-23  1:40 [PATCH] Choose CPU based on allocated IRQs Long Li
  2018-10-29 20:16 ` Michael Kelley
  2018-10-29 21:29 ` Thomas Gleixner
@ 2018-10-31 18:23 ` Hurwitz, Sherry
  2018-11-01 10:16   ` Thomas Gleixner
  2 siblings, 1 reply; 6+ messages in thread
From: Hurwitz, Sherry @ 2018-10-31 18:23 UTC (permalink / raw)
  To: longli; +Cc: linux-kernel, longli, tglx, Hurwitz, Sherry

Tested with fio on a 2 socket AMD EPYC based server with 128 cores and 8 nvme drives.  Comparing baseline linux-next kernel results with linux-next with this patch applied IOPS increased by 40%. 

sudo fio --bs=4k --ioengine=libaio --iodepth=16 --filename=/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1 --direct=1 --runtime=60 --numjobs=160 --rw=randread --name=test --group_reporting

Tested-by: Sherry Hurwitz <sherry.hurwitz@amd.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Tested-by test
  2018-10-31 18:23 ` Tested-by test Hurwitz, Sherry
@ 2018-11-01 10:16   ` Thomas Gleixner
  0 siblings, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2018-11-01 10:16 UTC (permalink / raw)
  To: Hurwitz, Sherry; +Cc: longli, linux-kernel, longli

Sherry,

On Wed, 31 Oct 2018, Hurwitz, Sherry wrote:

please do not rewrite the subject when replying to a patch.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-11-01 10:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-23  1:40 [PATCH] Choose CPU based on allocated IRQs Long Li
2018-10-29 20:16 ` Michael Kelley
2018-10-29 21:29 ` Thomas Gleixner
2018-10-30 17:44   ` Long Li
2018-10-31 18:23 ` Tested-by test Hurwitz, Sherry
2018-11-01 10:16   ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).