[PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-02-27  8:46 Jianchao Wang
  2018-02-27 15:13 ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Jianchao Wang @ 2018-02-27  8:46 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, adminq and ioq0 share the same irq vector. This is
unfair for both amdinq and ioq0.
 - For adminq, its completion irq has to be bound on cpu0.
 - For ioq0, when the irq fires for io completion, the adminq irq
   action has to be checked also.

To improve this, allocate separate irq vectors for adminq and
ioq0, and not set irq affinity for adminq one.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73036d2..7f421b7 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1456,7 +1456,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 		nvmeq->sq_cmds_io = dev->cmb + offset;
 	}
 
-	nvmeq->cq_vector = qid - 1;
+	nvmeq->cq_vector = qid;
 	result = adapter_alloc_cq(dev, qid, nvmeq);
 	if (result < 0)
 		return result;
@@ -1909,6 +1909,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int result, nr_io_queues;
 	unsigned long size;
+	struct irq_affinity affd = {.pre_vectors = 1};
+	int ret;
 
 	nr_io_queues = num_present_cpus();
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
@@ -1945,11 +1947,11 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * setting up the full range we need.
 	 */
 	pci_free_irq_vectors(pdev);
-	nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
-			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
-	if (nr_io_queues <= 0)
+	ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
+			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
+	if (ret <= 0)
 		return -EIO;
-	dev->max_qid = nr_io_queues;
+	dev->max_qid = ret - 1;
 
 	/*
 	 * Should investigate if there's a performance win from allocating
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-02-27  8:46 [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0 Jianchao Wang
@ 2018-02-27 15:13 ` Keith Busch
  2018-02-28  2:53   ` jianchao.wang
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2018-02-27 15:13 UTC (permalink / raw)
  To: Jianchao Wang; +Cc: axboe, hch, sagi, linux-nvme, linux-kernel

On Tue, Feb 27, 2018 at 04:46:17PM +0800, Jianchao Wang wrote:
> Currently, adminq and ioq0 share the same irq vector. This is
> unfair for both amdinq and ioq0.
>  - For adminq, its completion irq has to be bound on cpu0.
>  - For ioq0, when the irq fires for io completion, the adminq irq
>    action has to be checked also.

This change log could use some improvements. Why is it bad if admin
interrupts affinity is with cpu0?

Are you able to measure _any_ performance difference on IO queue 1 vs IO
queue 2 that you can attribute to IO queue 1's sharing vector 0?
 
> @@ -1945,11 +1947,11 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>  	 * setting up the full range we need.
>  	 */
>  	pci_free_irq_vectors(pdev);
> -	nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
> -			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
> -	if (nr_io_queues <= 0)
> +	ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
> +			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
> +	if (ret <= 0)
>  		return -EIO;
> -	dev->max_qid = nr_io_queues;
> +	dev->max_qid = ret - 1;

So controllers that have only legacy or single-message MSI don't get any
IO queues?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-02-27 15:13 ` Keith Busch
@ 2018-02-28  2:53   ` jianchao.wang
  2018-02-28 15:27     ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: jianchao.wang @ 2018-02-28  2:53 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi

Hi Keith

Thanks for your precious time to review this.

On 02/27/2018 11:13 PM, Keith Busch wrote:
> On Tue, Feb 27, 2018 at 04:46:17PM +0800, Jianchao Wang wrote:
>> Currently, adminq and ioq0 share the same irq vector. This is
>> unfair for both amdinq and ioq0.
>>  - For adminq, its completion irq has to be bound on cpu0.
>>  - For ioq0, when the irq fires for io completion, the adminq irq
>>    action has to be checked also.
> 
> This change log could use some improvements. Why is it bad if admin
> interrupts affinity is with cpu0?

adminq interrupts should be able to fire everywhere.
do we have any reason to bound it on cpu0 ?

> 
> Are you able to measure _any_ performance difference on IO queue 1 vs IO
> queue 2 that you can attribute to IO queue 1's sharing vector 0?

Actually, I didn't get any performance improving on my own NVMe card.
But it may be needed on some enterprise card, especially the media is persist memory.
nvme_irq will be invoked twice when ioq0 irq fires, this will introduce another unnecessary DMA
accessing on cq entry.

>  
>> @@ -1945,11 +1947,11 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>>  	 * setting up the full range we need.
>>  	 */
>>  	pci_free_irq_vectors(pdev);
>> -	nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
>> -			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
>> -	if (nr_io_queues <= 0)
>> +	ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
>> +			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
>> +	if (ret <= 0)
>>  		return -EIO;
>> -	dev->max_qid = nr_io_queues;
>> +	dev->max_qid = ret - 1;
> 
> So controllers that have only legacy or single-message MSI don't get any
> IO queues?
> 

Yes. At the moment, we have to share the only one irq vector.

Thanks for your directive. :)
Jianchao

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-02-28  2:53   ` jianchao.wang
@ 2018-02-28 15:27     ` Keith Busch
  2018-02-28 15:42       ` jianchao.wang
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2018-02-28 15:27 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi

On Wed, Feb 28, 2018 at 10:53:31AM +0800, jianchao.wang wrote:
> On 02/27/2018 11:13 PM, Keith Busch wrote:
> > On Tue, Feb 27, 2018 at 04:46:17PM +0800, Jianchao Wang wrote:
> >> Currently, adminq and ioq0 share the same irq vector. This is
> >> unfair for both amdinq and ioq0.
> >>  - For adminq, its completion irq has to be bound on cpu0.
> >>  - For ioq0, when the irq fires for io completion, the adminq irq
> >>    action has to be checked also.
> > 
> > This change log could use some improvements. Why is it bad if admin
> > interrupts affinity is with cpu0?
> 
> adminq interrupts should be able to fire everywhere.
> do we have any reason to bound it on cpu0 ?

Your patch will have the admin vector CPU affinity mask set to
0xff..ff. The first set bit for an online CPU is the one the IRQ handler
will run on, so the admin queue will still only run on CPU 0.
 
> > Are you able to measure _any_ performance difference on IO queue 1 vs IO
> > queue 2 that you can attribute to IO queue 1's sharing vector 0?
> 
> Actually, I didn't get any performance improving on my own NVMe card.
> But it may be needed on some enterprise card, especially the media is persist memory.
> nvme_irq will be invoked twice when ioq0 irq fires, this will introduce another unnecessary DMA
> accessing on cq entry.

A CPU reading its own memory isn't a DMA. It's just a cheap memory read.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-02-28 15:27     ` Keith Busch
@ 2018-02-28 15:42       ` jianchao.wang
  2018-02-28 15:46         ` jianchao.wang
  0 siblings, 1 reply; 7+ messages in thread
From: jianchao.wang @ 2018-02-28 15:42 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi

Hi Keith

Thanks for your kindly response and directive

On 02/28/2018 11:27 PM, Keith Busch wrote:
> On Wed, Feb 28, 2018 at 10:53:31AM +0800, jianchao.wang wrote:
>> On 02/27/2018 11:13 PM, Keith Busch wrote:
>>> On Tue, Feb 27, 2018 at 04:46:17PM +0800, Jianchao Wang wrote:
>>>> Currently, adminq and ioq0 share the same irq vector. This is
>>>> unfair for both amdinq and ioq0.
>>>>  - For adminq, its completion irq has to be bound on cpu0.
>>>>  - For ioq0, when the irq fires for io completion, the adminq irq
>>>>    action has to be checked also.
>>>
>>> This change log could use some improvements. Why is it bad if admin
>>> interrupts affinity is with cpu0?
>>
>> adminq interrupts should be able to fire everywhere.
>> do we have any reason to bound it on cpu0 ?
> 
> Your patch will have the admin vector CPU affinity mask set to
> 0xff..ff. The first set bit for an online CPU is the one the IRQ handler
> will run on, so the admin queue will still only run on CPU 0.

hmmm...yes.
When I test there is only one irq vector, I get following result:
 124:          0          0     253541          0          0          0          0          0  IR-PCI-MSI 1048576-edge      nvme0q0, nvme0q1

>  
>>> Are you able to measure _any_ performance difference on IO queue 1 vs IO
>>> queue 2 that you can attribute to IO queue 1's sharing vector 0?
>>
>> Actually, I didn't get any performance improving on my own NVMe card.
>> But it may be needed on some enterprise card, especially the media is persist memory.
>> nvme_irq will be invoked twice when ioq0 irq fires, this will introduce another unnecessary DMA
>> accessing on cq entry.
> 
> A CPU reading its own memory isn't a DMA. It's just a cheap memory read.

Oh sorry, my bad, I mean it is operation on DMA address, it is uncached.
nvme_irq
  -> nvme_process_cq
    -> nvme_read_cqe
      -> nvme_cqe_valid

static inline bool nvme_cqe_valid(struct nvme_queue *nvmeq, u16 head,
		u16 phase)
{
	return (le16_to_cpu(nvmeq->cqes[head].status) & 1) == phase;
}

Sincerely
Jianchao

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-02-28 15:42       ` jianchao.wang
@ 2018-02-28 15:46         ` jianchao.wang
  2018-02-28 15:53           ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: jianchao.wang @ 2018-02-28 15:46 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi



On 02/28/2018 11:42 PM, jianchao.wang wrote:
> Hi Keith
> 
> Thanks for your kindly response and directive
> 
> On 02/28/2018 11:27 PM, Keith Busch wrote:
>> On Wed, Feb 28, 2018 at 10:53:31AM +0800, jianchao.wang wrote:
>>> On 02/27/2018 11:13 PM, Keith Busch wrote:
>>>> On Tue, Feb 27, 2018 at 04:46:17PM +0800, Jianchao Wang wrote:
>>>>> Currently, adminq and ioq0 share the same irq vector. This is
>>>>> unfair for both amdinq and ioq0.
>>>>>  - For adminq, its completion irq has to be bound on cpu0.
>>>>>  - For ioq0, when the irq fires for io completion, the adminq irq
>>>>>    action has to be checked also.
>>>>
>>>> This change log could use some improvements. Why is it bad if admin
>>>> interrupts affinity is with cpu0?
>>>
>>> adminq interrupts should be able to fire everywhere.
>>> do we have any reason to bound it on cpu0 ?
>>
>> Your patch will have the admin vector CPU affinity mask set to
>> 0xff..ff. The first set bit for an online CPU is the one the IRQ handler
>> will run on, so the admin queue will still only run on CPU 0.
> 
> hmmm...yes.
> When I test there is only one irq vector, I get following result:
>  124:          0          0     253541          0          0          0          0          0  IR-PCI-MSI 1048576-edge      nvme0q0, nvme0q1
> 

the irqbalance may migrate the adminq irq away from cpu0.

>>  
>>>> Are you able to measure _any_ performance difference on IO queue 1 vs IO
>>>> queue 2 that you can attribute to IO queue 1's sharing vector 0?
>>>
>>> Actually, I didn't get any performance improving on my own NVMe card.
>>> But it may be needed on some enterprise card, especially the media is persist memory.
>>> nvme_irq will be invoked twice when ioq0 irq fires, this will introduce another unnecessary DMA
>>> accessing on cq entry.
>>
>> A CPU reading its own memory isn't a DMA. It's just a cheap memory read.
> 
> Oh sorry, my bad, I mean it is operation on DMA address, it is uncached.
> nvme_irq
>   -> nvme_process_cq
>     -> nvme_read_cqe
>       -> nvme_cqe_valid
> 
> static inline bool nvme_cqe_valid(struct nvme_queue *nvmeq, u16 head,
> 		u16 phase)
> {
> 	return (le16_to_cpu(nvmeq->cqes[head].status) & 1) == phase;
> }
> 
> Sincerely
> Jianchao
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-02-28 15:46         ` jianchao.wang
@ 2018-02-28 15:53           ` Keith Busch
  0 siblings, 0 replies; 7+ messages in thread
From: Keith Busch @ 2018-02-28 15:53 UTC (permalink / raw)
  To: jianchao.wang; +Cc: axboe, linux-kernel, hch, linux-nvme, sagi

On Wed, Feb 28, 2018 at 11:46:20PM +0800, jianchao.wang wrote:
> 
> the irqbalance may migrate the adminq irq away from cpu0.

No, irqbalance can't touch managed IRQs. See irq_can_set_affinity_usr().

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-02-28 15:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-27  8:46 [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0 Jianchao Wang
2018-02-27 15:13 ` Keith Busch
2018-02-28  2:53   ` jianchao.wang
2018-02-28 15:27     ` Keith Busch
2018-02-28 15:42       ` jianchao.wang
2018-02-28 15:46         ` jianchao.wang
2018-02-28 15:53           ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).