All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-02-28 15:48 ` Jianchao Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Jianchao Wang @ 2018-02-28 15:48 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, adminq and ioq0 share the same irq vector. This is
unfair for both amdinq and ioq0.
 - For adminq, its completion irq has to be bound on cpu0. It
   just has only one hw queue, it is unreasonable to do this.
 - For ioq0, when the irq fires for io completion, the adminq irq
   action on this irq vector will introduce an uncached access on
   adminq cqe at least, even worse when adminq is busy.

To improve this, allocate separate irq vectors for adminq and
ioq0, and not set irq affinity for adminq one. If just one irq
vector, setup adminq + 1 ioq and let them share it. In addition
add new helper interface nvme_ioq_vector to get ioq vector.

V1->V2
 - add case to handle the scenario where there is only one irq
   vector
 - add nvme_ioq_vector to map ioq vector and qid

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73036d2..273b157 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -84,6 +84,7 @@ struct nvme_dev {
 	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
 	unsigned max_qid;
+	unsigned int num_vecs;
 	int q_depth;
 	u32 db_stride;
 	void __iomem *bar;
@@ -139,6 +140,17 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
 	return container_of(ctrl, struct nvme_dev, ctrl);
 }
 
+static inline unsigned int nvme_ioq_vector(struct nvme_dev *dev,
+		unsigned int qid)
+{
+	/*
+	 * If controller has only legacy or single-message MSI, there will
+	 * be only 1 irq vector. At the moment, we setup adminq + 1 ioq
+	 * and let them share irq vector.
+	 */
+	return (dev->num_vecs == 1) ? 0 : qid;
+}
+
 /*
  * An NVM Express queue.  Each device has at least two (one for admin
  * commands and one for I/O commands).
@@ -1456,7 +1468,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 		nvmeq->sq_cmds_io = dev->cmb + offset;
 	}
 
-	nvmeq->cq_vector = qid - 1;
+	nvmeq->cq_vector = nvme_ioq_vector(dev, qid);
 	result = adapter_alloc_cq(dev, qid, nvmeq);
 	if (result < 0)
 		return result;
@@ -1626,9 +1638,9 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
 	int ret = 0;
 
 	for (i = dev->ctrl.queue_count; i <= dev->max_qid; i++) {
-		/* vector == qid - 1, match nvme_create_queue */
 		if (nvme_alloc_queue(dev, i, dev->q_depth,
-		     pci_irq_get_node(to_pci_dev(dev->dev), i - 1))) {
+		     pci_irq_get_node(to_pci_dev(dev->dev),
+				 nvme_ioq_vector(dev, i)))) {
 			ret = -ENOMEM;
 			break;
 		}
@@ -1909,6 +1921,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int result, nr_io_queues;
 	unsigned long size;
+	struct irq_affinity affd = {.pre_vectors = 1};
+	int ret;
 
 	nr_io_queues = num_present_cpus();
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
@@ -1945,12 +1959,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * setting up the full range we need.
 	 */
 	pci_free_irq_vectors(pdev);
-	nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
-			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
-	if (nr_io_queues <= 0)
+	ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
+			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
+	if (ret <= 0)
 		return -EIO;
-	dev->max_qid = nr_io_queues;
-
+	dev->num_vecs = ret;
+	dev->max_qid = (ret > 1) ? (ret - 1) : 1;
 	/*
 	 * Should investigate if there's a performance win from allocating
 	 * more queues than interrupt vectors; it might allow the submission
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-02-28 15:48 ` Jianchao Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Jianchao Wang @ 2018-02-28 15:48 UTC (permalink / raw)


Currently, adminq and ioq0 share the same irq vector. This is
unfair for both amdinq and ioq0.
 - For adminq, its completion irq has to be bound on cpu0. It
   just has only one hw queue, it is unreasonable to do this.
 - For ioq0, when the irq fires for io completion, the adminq irq
   action on this irq vector will introduce an uncached access on
   adminq cqe at least, even worse when adminq is busy.

To improve this, allocate separate irq vectors for adminq and
ioq0, and not set irq affinity for adminq one. If just one irq
vector, setup adminq + 1 ioq and let them share it. In addition
add new helper interface nvme_ioq_vector to get ioq vector.

V1->V2
 - add case to handle the scenario where there is only one irq
   vector
 - add nvme_ioq_vector to map ioq vector and qid

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73036d2..273b157 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -84,6 +84,7 @@ struct nvme_dev {
 	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
 	unsigned max_qid;
+	unsigned int num_vecs;
 	int q_depth;
 	u32 db_stride;
 	void __iomem *bar;
@@ -139,6 +140,17 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl *ctrl)
 	return container_of(ctrl, struct nvme_dev, ctrl);
 }
 
+static inline unsigned int nvme_ioq_vector(struct nvme_dev *dev,
+		unsigned int qid)
+{
+	/*
+	 * If controller has only legacy or single-message MSI, there will
+	 * be only 1 irq vector. At the moment, we setup adminq + 1 ioq
+	 * and let them share irq vector.
+	 */
+	return (dev->num_vecs == 1) ? 0 : qid;
+}
+
 /*
  * An NVM Express queue.  Each device has at least two (one for admin
  * commands and one for I/O commands).
@@ -1456,7 +1468,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 		nvmeq->sq_cmds_io = dev->cmb + offset;
 	}
 
-	nvmeq->cq_vector = qid - 1;
+	nvmeq->cq_vector = nvme_ioq_vector(dev, qid);
 	result = adapter_alloc_cq(dev, qid, nvmeq);
 	if (result < 0)
 		return result;
@@ -1626,9 +1638,9 @@ static int nvme_create_io_queues(struct nvme_dev *dev)
 	int ret = 0;
 
 	for (i = dev->ctrl.queue_count; i <= dev->max_qid; i++) {
-		/* vector == qid - 1, match nvme_create_queue */
 		if (nvme_alloc_queue(dev, i, dev->q_depth,
-		     pci_irq_get_node(to_pci_dev(dev->dev), i - 1))) {
+		     pci_irq_get_node(to_pci_dev(dev->dev),
+				 nvme_ioq_vector(dev, i)))) {
 			ret = -ENOMEM;
 			break;
 		}
@@ -1909,6 +1921,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	int result, nr_io_queues;
 	unsigned long size;
+	struct irq_affinity affd = {.pre_vectors = 1};
+	int ret;
 
 	nr_io_queues = num_present_cpus();
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
@@ -1945,12 +1959,12 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * setting up the full range we need.
 	 */
 	pci_free_irq_vectors(pdev);
-	nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
-			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
-	if (nr_io_queues <= 0)
+	ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
+			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
+	if (ret <= 0)
 		return -EIO;
-	dev->max_qid = nr_io_queues;
-
+	dev->num_vecs = ret;
+	dev->max_qid = (ret > 1) ? (ret - 1) : 1;
 	/*
 	 * Should investigate if there's a performance win from allocating
 	 * more queues than interrupt vectors; it might allow the submission
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-02-28 15:48 ` Jianchao Wang
@ 2018-02-28 15:59   ` Andy Shevchenko
  -1 siblings, 0 replies; 41+ messages in thread
From: Andy Shevchenko @ 2018-02-28 15:59 UTC (permalink / raw)
  To: Jianchao Wang
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Linux NVMe Mailinglist, Linux Kernel Mailing List

On Wed, Feb 28, 2018 at 5:48 PM, Jianchao Wang
<jianchao.w.wang@oracle.com> wrote:
> Currently, adminq and ioq0 share the same irq vector. This is
> unfair for both amdinq and ioq0.
>  - For adminq, its completion irq has to be bound on cpu0. It
>    just has only one hw queue, it is unreasonable to do this.
>  - For ioq0, when the irq fires for io completion, the adminq irq
>    action on this irq vector will introduce an uncached access on
>    adminq cqe at least, even worse when adminq is busy.
>
> To improve this, allocate separate irq vectors for adminq and
> ioq0, and not set irq affinity for adminq one. If just one irq
> vector, setup adminq + 1 ioq and let them share it. In addition
> add new helper interface nvme_ioq_vector to get ioq vector.

> +static inline unsigned int nvme_ioq_vector(struct nvme_dev *dev,
> +               unsigned int qid)
> +{
> +       /*
> +        * If controller has only legacy or single-message MSI, there will
> +        * be only 1 irq vector. At the moment, we setup adminq + 1 ioq
> +        * and let them share irq vector.
> +        */
> +       return (dev->num_vecs == 1) ? 0 : qid;

Redundant parens.

> +}

>
>         for (i = dev->ctrl.queue_count; i <= dev->max_qid; i++) {
> -               /* vector == qid - 1, match nvme_create_queue */

>                 if (nvme_alloc_queue(dev, i, dev->q_depth,
> -                    pci_irq_get_node(to_pci_dev(dev->dev), i - 1))) {
> +                    pci_irq_get_node(to_pci_dev(dev->dev),
> +                                nvme_ioq_vector(dev, i)))) {

Perhaps better to introduce a temporary variable to make it readable?

>                         ret = -ENOMEM;
>                         break;
>                 }

> +       ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
> +                       PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
> +       if (ret <= 0)
>                 return -EIO;
> -       dev->max_qid = nr_io_queues;
> -
> +       dev->num_vecs = ret;
> +       dev->max_qid = (ret > 1) ? (ret - 1) : 1;

I don not see how ret can possible be < 1 here.

Thus, the logic looks like this:
if ret >= 2 => return ret - 1; // Possible variants [1..ret - 1]
if ret == 1 => return 1;

So, for ret == 1 or ret == 2 we still use 1.

Is it by design?

Can it be written like

dev->max_qid = max(ret - 1, 1);

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-02-28 15:59   ` Andy Shevchenko
  0 siblings, 0 replies; 41+ messages in thread
From: Andy Shevchenko @ 2018-02-28 15:59 UTC (permalink / raw)


On Wed, Feb 28, 2018 at 5:48 PM, Jianchao Wang
<jianchao.w.wang@oracle.com> wrote:
> Currently, adminq and ioq0 share the same irq vector. This is
> unfair for both amdinq and ioq0.
>  - For adminq, its completion irq has to be bound on cpu0. It
>    just has only one hw queue, it is unreasonable to do this.
>  - For ioq0, when the irq fires for io completion, the adminq irq
>    action on this irq vector will introduce an uncached access on
>    adminq cqe at least, even worse when adminq is busy.
>
> To improve this, allocate separate irq vectors for adminq and
> ioq0, and not set irq affinity for adminq one. If just one irq
> vector, setup adminq + 1 ioq and let them share it. In addition
> add new helper interface nvme_ioq_vector to get ioq vector.

> +static inline unsigned int nvme_ioq_vector(struct nvme_dev *dev,
> +               unsigned int qid)
> +{
> +       /*
> +        * If controller has only legacy or single-message MSI, there will
> +        * be only 1 irq vector. At the moment, we setup adminq + 1 ioq
> +        * and let them share irq vector.
> +        */
> +       return (dev->num_vecs == 1) ? 0 : qid;

Redundant parens.

> +}

>
>         for (i = dev->ctrl.queue_count; i <= dev->max_qid; i++) {
> -               /* vector == qid - 1, match nvme_create_queue */

>                 if (nvme_alloc_queue(dev, i, dev->q_depth,
> -                    pci_irq_get_node(to_pci_dev(dev->dev), i - 1))) {
> +                    pci_irq_get_node(to_pci_dev(dev->dev),
> +                                nvme_ioq_vector(dev, i)))) {

Perhaps better to introduce a temporary variable to make it readable?

>                         ret = -ENOMEM;
>                         break;
>                 }

> +       ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
> +                       PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
> +       if (ret <= 0)
>                 return -EIO;
> -       dev->max_qid = nr_io_queues;
> -
> +       dev->num_vecs = ret;
> +       dev->max_qid = (ret > 1) ? (ret - 1) : 1;

I don not see how ret can possible be < 1 here.

Thus, the logic looks like this:
if ret >= 2 => return ret - 1; // Possible variants [1..ret - 1]
if ret == 1 => return 1;

So, for ret == 1 or ret == 2 we still use 1.

Is it by design?

Can it be written like

dev->max_qid = max(ret - 1, 1);

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-02-28 15:48 ` Jianchao Wang
@ 2018-02-28 16:47   ` Christoph Hellwig
  -1 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2018-02-28 16:47 UTC (permalink / raw)
  To: Jianchao Wang; +Cc: keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel

Note that we originally allocates irqs this way, and Keith changed
it a while ago for good reasons.  So I'd really like to see good
reasons for moving away from this, and some heuristics to figure
out which way to use.  E.g. if the device supports more irqs than
I/O queues your scheme might always be fine.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-02-28 16:47   ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2018-02-28 16:47 UTC (permalink / raw)


Note that we originally allocates irqs this way, and Keith changed
it a while ago for good reasons.  So I'd really like to see good
reasons for moving away from this, and some heuristics to figure
out which way to use.  E.g. if the device supports more irqs than
I/O queues your scheme might always be fine.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-02-28 16:47   ` Christoph Hellwig
@ 2018-03-01  9:28     ` Sagi Grimberg
  -1 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2018-03-01  9:28 UTC (permalink / raw)
  To: Christoph Hellwig, Jianchao Wang
  Cc: keith.busch, axboe, linux-nvme, linux-kernel


> Note that we originally allocates irqs this way, and Keith changed
> it a while ago for good reasons.  So I'd really like to see good
> reasons for moving away from this, and some heuristics to figure
> out which way to use.  E.g. if the device supports more irqs than
> I/O queues your scheme might always be fine.

I still don't understand what this buys us in practice. Seems redundant
to allocate another vector without any (even marginal) difference.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-01  9:28     ` Sagi Grimberg
  0 siblings, 0 replies; 41+ messages in thread
From: Sagi Grimberg @ 2018-03-01  9:28 UTC (permalink / raw)



> Note that we originally allocates irqs this way, and Keith changed
> it a while ago for good reasons.  So I'd really like to see good
> reasons for moving away from this, and some heuristics to figure
> out which way to use.  E.g. if the device supports more irqs than
> I/O queues your scheme might always be fine.

I still don't understand what this buys us in practice. Seems redundant
to allocate another vector without any (even marginal) difference.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-03-01  9:28     ` Sagi Grimberg
@ 2018-03-01 10:05       ` jianchao.wang
  -1 siblings, 0 replies; 41+ messages in thread
From: jianchao.wang @ 2018-03-01 10:05 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: keith.busch, axboe, linux-kernel, linux-nvme

Hi sagi

Thanks for your kindly response.

On 03/01/2018 05:28 PM, Sagi Grimberg wrote:
> 
>> Note that we originally allocates irqs this way, and Keith changed
>> it a while ago for good reasons.  So I'd really like to see good
>> reasons for moving away from this, and some heuristics to figure
>> out which way to use.  E.g. if the device supports more irqs than
>> I/O queues your scheme might always be fine.
> 
> I still don't understand what this buys us in practice. Seems redundant
> to allocate another vector without any (even marginal) difference.
> 

When the adminq is free, ioq0 irq completion path has to invoke nvme_irq twice, one for itself,
one for adminq completion irq action.
We are trying to save every cpu cycle across the nvme host path, why we waste nvme_irq cycles here.
If we have enough vectors, we could allocate another irq vector for adminq to avoid this.

Sincerely
Jianchao 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-01 10:05       ` jianchao.wang
  0 siblings, 0 replies; 41+ messages in thread
From: jianchao.wang @ 2018-03-01 10:05 UTC (permalink / raw)


Hi sagi

Thanks for your kindly response.

On 03/01/2018 05:28 PM, Sagi Grimberg wrote:
> 
>> Note that we originally allocates irqs this way, and Keith changed
>> it a while ago for good reasons.? So I'd really like to see good
>> reasons for moving away from this, and some heuristics to figure
>> out which way to use.? E.g. if the device supports more irqs than
>> I/O queues your scheme might always be fine.
> 
> I still don't understand what this buys us in practice. Seems redundant
> to allocate another vector without any (even marginal) difference.
> 

When the adminq is free, ioq0 irq completion path has to invoke nvme_irq twice, one for itself,
one for adminq completion irq action.
We are trying to save every cpu cycle across the nvme host path, why we waste nvme_irq cycles here.
If we have enough vectors, we could allocate another irq vector for adminq to avoid this.

Sincerely
Jianchao 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-02-28 16:47   ` Christoph Hellwig
@ 2018-03-01 15:03     ` Ming Lei
  -1 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-03-01 15:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jianchao Wang, axboe, sagi, linux-kernel, linux-nvme, keith.busch

On Wed, Feb 28, 2018 at 05:47:26PM +0100, Christoph Hellwig wrote:
> Note that we originally allocates irqs this way, and Keith changed
> it a while ago for good reasons.  So I'd really like to see good
> reasons for moving away from this, and some heuristics to figure
> out which way to use.  E.g. if the device supports more irqs than
> I/O queues your scheme might always be fine.

If all CPUs for the 1st IRQ vector of admin queue are offline, then I
guess NVMe can't work any more.

So looks it is a good idea to make admin queue's IRQ vector assigned as
non-managed IRQs.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-01 15:03     ` Ming Lei
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-03-01 15:03 UTC (permalink / raw)


On Wed, Feb 28, 2018@05:47:26PM +0100, Christoph Hellwig wrote:
> Note that we originally allocates irqs this way, and Keith changed
> it a while ago for good reasons.  So I'd really like to see good
> reasons for moving away from this, and some heuristics to figure
> out which way to use.  E.g. if the device supports more irqs than
> I/O queues your scheme might always be fine.

If all CPUs for the 1st IRQ vector of admin queue are offline, then I
guess NVMe can't work any more.

So looks it is a good idea to make admin queue's IRQ vector assigned as
non-managed IRQs.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-03-01 10:05       ` jianchao.wang
@ 2018-03-01 15:15         ` Keith Busch
  -1 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2018-03-01 15:15 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Sagi Grimberg, Christoph Hellwig, axboe, linux-kernel, linux-nvme

On Thu, Mar 01, 2018 at 06:05:53PM +0800, jianchao.wang wrote:
> When the adminq is free, ioq0 irq completion path has to invoke nvme_irq twice, one for itself,
> one for adminq completion irq action.

Let's be a little more careful on the terminology when referring to spec
defined features: there is no such thing as "ioq0". The IO queues start
at 1. The admin queue is the '0' index queue.

> We are trying to save every cpu cycle across the nvme host path, why we waste nvme_irq cycles here.
> If we have enough vectors, we could allocate another irq vector for adminq to avoid this.

Please understand the _overwhelming_ majority of time spent for IRQ
handling is the context switches. There's a reason you're not able to
measure a perf difference between IOQ1 and IOQ2: the number of CPU cycles
to chain a second action is negligible.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-01 15:15         ` Keith Busch
  0 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2018-03-01 15:15 UTC (permalink / raw)


On Thu, Mar 01, 2018@06:05:53PM +0800, jianchao.wang wrote:
> When the adminq is free, ioq0 irq completion path has to invoke nvme_irq twice, one for itself,
> one for adminq completion irq action.

Let's be a little more careful on the terminology when referring to spec
defined features: there is no such thing as "ioq0". The IO queues start
at 1. The admin queue is the '0' index queue.

> We are trying to save every cpu cycle across the nvme host path, why we waste nvme_irq cycles here.
> If we have enough vectors, we could allocate another irq vector for adminq to avoid this.

Please understand the _overwhelming_ majority of time spent for IRQ
handling is the context switches. There's a reason you're not able to
measure a perf difference between IOQ1 and IOQ2: the number of CPU cycles
to chain a second action is negligible.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-03-01 15:03     ` Ming Lei
@ 2018-03-01 16:10       ` Keith Busch
  -1 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2018-03-01 16:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jianchao Wang, axboe, sagi, linux-kernel, linux-nvme

On Thu, Mar 01, 2018 at 11:03:30PM +0800, Ming Lei wrote:
> If all CPUs for the 1st IRQ vector of admin queue are offline, then I
> guess NVMe can't work any more.

Yikes, with respect to admin commands, it appears you're right if your
system allows offlining CPU0.

> So looks it is a good idea to make admin queue's IRQ vector assigned as
> non-managed IRQs.

It'd still be considered managed even if it's a 'pre_vector', though
it would get the default mask with all possible CPUs.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-01 16:10       ` Keith Busch
  0 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2018-03-01 16:10 UTC (permalink / raw)


On Thu, Mar 01, 2018@11:03:30PM +0800, Ming Lei wrote:
> If all CPUs for the 1st IRQ vector of admin queue are offline, then I
> guess NVMe can't work any more.

Yikes, with respect to admin commands, it appears you're right if your
system allows offlining CPU0.

> So looks it is a good idea to make admin queue's IRQ vector assigned as
> non-managed IRQs.

It'd still be considered managed even if it's a 'pre_vector', though
it would get the default mask with all possible CPUs.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-02-28 15:59   ` Andy Shevchenko
@ 2018-03-02  3:06     ` jianchao.wang
  -1 siblings, 0 replies; 41+ messages in thread
From: jianchao.wang @ 2018-03-02  3:06 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Linux NVMe Mailinglist, Linux Kernel Mailing List

Hi Andy

Thanks for your precious time for this and kindly reminding.

On 02/28/2018 11:59 PM, Andy Shevchenko wrote:
> On Wed, Feb 28, 2018 at 5:48 PM, Jianchao Wang
> <jianchao.w.wang@oracle.com> wrote:
>> Currently, adminq and ioq0 share the same irq vector. This is
>> unfair for both amdinq and ioq0.
>>  - For adminq, its completion irq has to be bound on cpu0. It
>>    just has only one hw queue, it is unreasonable to do this.
>>  - For ioq0, when the irq fires for io completion, the adminq irq
>>    action on this irq vector will introduce an uncached access on
>>    adminq cqe at least, even worse when adminq is busy.
>>
>> To improve this, allocate separate irq vectors for adminq and
>> ioq0, and not set irq affinity for adminq one. If just one irq
>> vector, setup adminq + 1 ioq and let them share it. In addition
>> add new helper interface nvme_ioq_vector to get ioq vector.
> 
>> +static inline unsigned int nvme_ioq_vector(struct nvme_dev *dev,
>> +               unsigned int qid)
>> +{
>> +       /*
>> +        * If controller has only legacy or single-message MSI, there will
>> +        * be only 1 irq vector. At the moment, we setup adminq + 1 ioq
>> +        * and let them share irq vector.
>> +        */
>> +       return (dev->num_vecs == 1) ? 0 : qid;
> 
> Redundant parens.

Yes, but parens make it more clearly

> 
>> +}
> 
>>
>>         for (i = dev->ctrl.queue_count; i <= dev->max_qid; i++) {
>> -               /* vector == qid - 1, match nvme_create_queue */
> 
>>                 if (nvme_alloc_queue(dev, i, dev->q_depth,
>> -                    pci_irq_get_node(to_pci_dev(dev->dev), i - 1))) {
>> +                    pci_irq_get_node(to_pci_dev(dev->dev),
>> +                                nvme_ioq_vector(dev, i)))) {
> 
> Perhaps better to introduce a temporary variable to make it readable?

yes, indeed.

> 
>>                         ret = -ENOMEM;
>>                         break;
>>                 }
> 
>> +       ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
>> +                       PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
>> +       if (ret <= 0)
>>                 return -EIO;
>> -       dev->max_qid = nr_io_queues;
>> -
>> +       dev->num_vecs = ret;
>> +       dev->max_qid = (ret > 1) ? (ret - 1) : 1;
> 
> I don not see how ret can possible be < 1 here.
> 
> Thus, the logic looks like this:
> if ret >= 2 => return ret - 1; // Possible variants [1..ret - 1]
> if ret == 1 => return 1;
> 
> So, for ret == 1 or ret == 2 we still use 1.
> 
> Is it by design?
> 
> Can it be written like
> 
> dev->max_qid = max(ret - 1, 1);
> 

Yes, it looks like more clearly.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-02  3:06     ` jianchao.wang
  0 siblings, 0 replies; 41+ messages in thread
From: jianchao.wang @ 2018-03-02  3:06 UTC (permalink / raw)


Hi Andy

Thanks for your precious time for this and kindly reminding.

On 02/28/2018 11:59 PM, Andy Shevchenko wrote:
> On Wed, Feb 28, 2018 at 5:48 PM, Jianchao Wang
> <jianchao.w.wang@oracle.com> wrote:
>> Currently, adminq and ioq0 share the same irq vector. This is
>> unfair for both amdinq and ioq0.
>>  - For adminq, its completion irq has to be bound on cpu0. It
>>    just has only one hw queue, it is unreasonable to do this.
>>  - For ioq0, when the irq fires for io completion, the adminq irq
>>    action on this irq vector will introduce an uncached access on
>>    adminq cqe at least, even worse when adminq is busy.
>>
>> To improve this, allocate separate irq vectors for adminq and
>> ioq0, and not set irq affinity for adminq one. If just one irq
>> vector, setup adminq + 1 ioq and let them share it. In addition
>> add new helper interface nvme_ioq_vector to get ioq vector.
> 
>> +static inline unsigned int nvme_ioq_vector(struct nvme_dev *dev,
>> +               unsigned int qid)
>> +{
>> +       /*
>> +        * If controller has only legacy or single-message MSI, there will
>> +        * be only 1 irq vector. At the moment, we setup adminq + 1 ioq
>> +        * and let them share irq vector.
>> +        */
>> +       return (dev->num_vecs == 1) ? 0 : qid;
> 
> Redundant parens.

Yes, but parens make it more clearly

> 
>> +}
> 
>>
>>         for (i = dev->ctrl.queue_count; i <= dev->max_qid; i++) {
>> -               /* vector == qid - 1, match nvme_create_queue */
> 
>>                 if (nvme_alloc_queue(dev, i, dev->q_depth,
>> -                    pci_irq_get_node(to_pci_dev(dev->dev), i - 1))) {
>> +                    pci_irq_get_node(to_pci_dev(dev->dev),
>> +                                nvme_ioq_vector(dev, i)))) {
> 
> Perhaps better to introduce a temporary variable to make it readable?

yes, indeed.

> 
>>                         ret = -ENOMEM;
>>                         break;
>>                 }
> 
>> +       ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
>> +                       PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
>> +       if (ret <= 0)
>>                 return -EIO;
>> -       dev->max_qid = nr_io_queues;
>> -
>> +       dev->num_vecs = ret;
>> +       dev->max_qid = (ret > 1) ? (ret - 1) : 1;
> 
> I don not see how ret can possible be < 1 here.
> 
> Thus, the logic looks like this:
> if ret >= 2 => return ret - 1; // Possible variants [1..ret - 1]
> if ret == 1 => return 1;
> 
> So, for ret == 1 or ret == 2 we still use 1.
> 
> Is it by design?
> 
> Can it be written like
> 
> dev->max_qid = max(ret - 1, 1);
> 

Yes, it looks like more clearly.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-03-01 15:15         ` Keith Busch
@ 2018-03-02  3:11           ` jianchao.wang
  -1 siblings, 0 replies; 41+ messages in thread
From: jianchao.wang @ 2018-03-02  3:11 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Christoph Hellwig, axboe, linux-kernel, linux-nvme

Hi Keith

Thanks for your kindly directive and precious time for this.

On 03/01/2018 11:15 PM, Keith Busch wrote:
> On Thu, Mar 01, 2018 at 06:05:53PM +0800, jianchao.wang wrote:
>> When the adminq is free, ioq0 irq completion path has to invoke nvme_irq twice, one for itself,
>> one for adminq completion irq action.
> 
> Let's be a little more careful on the terminology when referring to spec
> defined features: there is no such thing as "ioq0". The IO queues start
> at 1. The admin queue is the '0' index queue.

Yes, indeed, sorry for my bad description.

>> We are trying to save every cpu cycle across the nvme host path, why we waste nvme_irq cycles here.
>> If we have enough vectors, we could allocate another irq vector for adminq to avoid this.
> 
> Please understand the _overwhelming_ majority of time spent for IRQ
> handling is the context switches. There's a reason you're not able to
> measure a perf difference between IOQ1 and IOQ2: the number of CPU cycles
> to chain a second action is negligible.
> 

Yes, indeed

Sincerely
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-02  3:11           ` jianchao.wang
  0 siblings, 0 replies; 41+ messages in thread
From: jianchao.wang @ 2018-03-02  3:11 UTC (permalink / raw)


Hi Keith

Thanks for your kindly directive and precious time for this.

On 03/01/2018 11:15 PM, Keith Busch wrote:
> On Thu, Mar 01, 2018@06:05:53PM +0800, jianchao.wang wrote:
>> When the adminq is free, ioq0 irq completion path has to invoke nvme_irq twice, one for itself,
>> one for adminq completion irq action.
> 
> Let's be a little more careful on the terminology when referring to spec
> defined features: there is no such thing as "ioq0". The IO queues start
> at 1. The admin queue is the '0' index queue.

Yes, indeed, sorry for my bad description.

>> We are trying to save every cpu cycle across the nvme host path, why we waste nvme_irq cycles here.
>> If we have enough vectors, we could allocate another irq vector for adminq to avoid this.
> 
> Please understand the _overwhelming_ majority of time spent for IRQ
> handling is the context switches. There's a reason you're not able to
> measure a perf difference between IOQ1 and IOQ2: the number of CPU cycles
> to chain a second action is negligible.
> 

Yes, indeed

Sincerely
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-02-28 16:47   ` Christoph Hellwig
@ 2018-03-02  3:18     ` jianchao.wang
  -1 siblings, 0 replies; 41+ messages in thread
From: jianchao.wang @ 2018-03-02  3:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, sagi, linux-kernel, linux-nvme, keith.busch

Hi Christoph

Thanks for your kindly response and directive

On 03/01/2018 12:47 AM, Christoph Hellwig wrote:
> Note that we originally allocates irqs this way, and Keith changed
> it a while ago for good reasons.  So I'd really like to see good
> reasons for moving away from this, and some heuristics to figure
> out which way to use.  E.g. if the device supports more irqs than
> I/O queues your scheme might always be fine.

maybe we could add a logic that when get enough irq vectors, assign
separate irq vector to adminq, otherwise sharing.

Sincerely
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-02  3:18     ` jianchao.wang
  0 siblings, 0 replies; 41+ messages in thread
From: jianchao.wang @ 2018-03-02  3:18 UTC (permalink / raw)


Hi Christoph

Thanks for your kindly response and directive

On 03/01/2018 12:47 AM, Christoph Hellwig wrote:
> Note that we originally allocates irqs this way, and Keith changed
> it a while ago for good reasons.  So I'd really like to see good
> reasons for moving away from this, and some heuristics to figure
> out which way to use.  E.g. if the device supports more irqs than
> I/O queues your scheme might always be fine.

maybe we could add a logic that when get enough irq vectors, assign
separate irq vector to adminq, otherwise sharing.

Sincerely
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-03-01 16:10       ` Keith Busch
@ 2018-03-08  7:42         ` Christoph Hellwig
  -1 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2018-03-08  7:42 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, Christoph Hellwig, Jianchao Wang, axboe, sagi,
	linux-kernel, linux-nvme

On Thu, Mar 01, 2018 at 09:10:42AM -0700, Keith Busch wrote:
> On Thu, Mar 01, 2018 at 11:03:30PM +0800, Ming Lei wrote:
> > If all CPUs for the 1st IRQ vector of admin queue are offline, then I
> > guess NVMe can't work any more.
> 
> Yikes, with respect to admin commands, it appears you're right if your
> system allows offlining CPU0.
> 
> > So looks it is a good idea to make admin queue's IRQ vector assigned as
> > non-managed IRQs.
> 
> It'd still be considered managed even if it's a 'pre_vector', though
> it would get the default mask with all possible CPUs.

Which basically does the right thing.  So I suspect we'll need to
go with a patch like this, just with a way better changelog.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-08  7:42         ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2018-03-08  7:42 UTC (permalink / raw)


On Thu, Mar 01, 2018@09:10:42AM -0700, Keith Busch wrote:
> On Thu, Mar 01, 2018@11:03:30PM +0800, Ming Lei wrote:
> > If all CPUs for the 1st IRQ vector of admin queue are offline, then I
> > guess NVMe can't work any more.
> 
> Yikes, with respect to admin commands, it appears you're right if your
> system allows offlining CPU0.
> 
> > So looks it is a good idea to make admin queue's IRQ vector assigned as
> > non-managed IRQs.
> 
> It'd still be considered managed even if it's a 'pre_vector', though
> it would get the default mask with all possible CPUs.

Which basically does the right thing.  So I suspect we'll need to
go with a patch like this, just with a way better changelog.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-03-08  7:42         ` Christoph Hellwig
@ 2018-03-09 17:24           ` Keith Busch
  -1 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2018-03-09 17:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Jianchao Wang, axboe, sagi, linux-kernel, linux-nvme

On Thu, Mar 08, 2018 at 08:42:20AM +0100, Christoph Hellwig wrote:
> 
> So I suspect we'll need to go with a patch like this, just with a way
> better changelog.

I have to agree this is required for that use case. I'll run some
quick tests and propose an alternate changelog.

Longer term, the current way we're including offline present cpus either
(a) has the driver allocate resources it can't use or (b) spreads the
ones it can use thinner than they need to be. Why don't we rerun the
irq spread under a hot cpu notifier for only online CPUs?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-09 17:24           ` Keith Busch
  0 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2018-03-09 17:24 UTC (permalink / raw)


On Thu, Mar 08, 2018@08:42:20AM +0100, Christoph Hellwig wrote:
> 
> So I suspect we'll need to go with a patch like this, just with a way
> better changelog.

I have to agree this is required for that use case. I'll run some
quick tests and propose an alternate changelog.

Longer term, the current way we're including offline present cpus either
(a) has the driver allocate resources it can't use or (b) spreads the
ones it can use thinner than they need to be. Why don't we rerun the
irq spread under a hot cpu notifier for only online CPUs?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-03-09 17:24           ` Keith Busch
@ 2018-03-12  9:09             ` Ming Lei
  -1 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-03-12  9:09 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, sagi, linux-kernel, linux-nvme, axboe,
	Jianchao Wang, linux-block, Thomas Gleixner

On Fri, Mar 09, 2018 at 10:24:45AM -0700, Keith Busch wrote:
> On Thu, Mar 08, 2018 at 08:42:20AM +0100, Christoph Hellwig wrote:
> > 
> > So I suspect we'll need to go with a patch like this, just with a way
> > better changelog.
> 
> I have to agree this is required for that use case. I'll run some
> quick tests and propose an alternate changelog.
> 
> Longer term, the current way we're including offline present cpus either
> (a) has the driver allocate resources it can't use or (b) spreads the
> ones it can use thinner than they need to be. Why don't we rerun the
> irq spread under a hot cpu notifier for only online CPUs?

4b855ad371 ("blk-mq: Create hctx for each present CPU") removes handling
mapping change via hot cpu notifier. Not only code is cleaned up, but
also fixes very complicated queue dependency issue:

- loop/dm-rq queue depends on underlying queue
- for NVMe, IO queue depends on admin queue

If freezing queue can be avoided in CPU notifier, it should be fine to
do that, otherwise it need to be avoided.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-12  9:09             ` Ming Lei
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-03-12  9:09 UTC (permalink / raw)


On Fri, Mar 09, 2018@10:24:45AM -0700, Keith Busch wrote:
> On Thu, Mar 08, 2018@08:42:20AM +0100, Christoph Hellwig wrote:
> > 
> > So I suspect we'll need to go with a patch like this, just with a way
> > better changelog.
> 
> I have to agree this is required for that use case. I'll run some
> quick tests and propose an alternate changelog.
> 
> Longer term, the current way we're including offline present cpus either
> (a) has the driver allocate resources it can't use or (b) spreads the
> ones it can use thinner than they need to be. Why don't we rerun the
> irq spread under a hot cpu notifier for only online CPUs?

4b855ad371 ("blk-mq: Create hctx for each present CPU") removes handling
mapping change via hot cpu notifier. Not only code is cleaned up, but
also fixes very complicated queue dependency issue:

- loop/dm-rq queue depends on underlying queue
- for NVMe, IO queue depends on admin queue

If freezing queue can be avoided in CPU notifier, it should be fine to
do that, otherwise it need to be avoided.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-03-02  3:06     ` jianchao.wang
@ 2018-03-12 18:59       ` Keith Busch
  -1 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2018-03-12 18:59 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Andy Shevchenko, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Linux NVMe Mailinglist, Linux Kernel Mailing List

Hi Jianchao,

The patch tests fine on all hardware I had. I'd like to queue this up
for the next 4.16-rc. Could you send a v3 with the cleanup changes Andy
suggested and a changelog aligned with Ming's insights?

Thanks,
Keith

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-12 18:59       ` Keith Busch
  0 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2018-03-12 18:59 UTC (permalink / raw)


Hi Jianchao,

The patch tests fine on all hardware I had. I'd like to queue this up
for the next 4.16-rc. Could you send a v3 with the cleanup changes Andy
suggested and a changelog aligned with Ming's insights?

Thanks,
Keith

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
  2018-03-12 18:59       ` Keith Busch
@ 2018-03-13  1:47         ` jianchao.wang
  -1 siblings, 0 replies; 41+ messages in thread
From: jianchao.wang @ 2018-03-13  1:47 UTC (permalink / raw)
  To: Keith Busch
  Cc: Andy Shevchenko, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Linux NVMe Mailinglist, Linux Kernel Mailing List

Hi Keith

Thanks for your precious time for testing and reviewing.
I will send out V3 next.

Sincerely
Jianchao

On 03/13/2018 02:59 AM, Keith Busch wrote:
> Hi Jianchao,
> 
> The patch tests fine on all hardware I had. I'd like to queue this up
> for the next 4.16-rc. Could you send a v3 with the cleanup changes Andy
> suggested and a changelog aligned with Ming's insights?
> 
> Thanks,
> Keith
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0
@ 2018-03-13  1:47         ` jianchao.wang
  0 siblings, 0 replies; 41+ messages in thread
From: jianchao.wang @ 2018-03-13  1:47 UTC (permalink / raw)


Hi Keith

Thanks for your precious time for testing and reviewing.
I will send out V3 next.

Sincerely
Jianchao

On 03/13/2018 02:59 AM, Keith Busch wrote:
> Hi Jianchao,
> 
> The patch tests fine on all hardware I had. I'd like to queue this up
> for the next 4.16-rc. Could you send a v3 with the cleanup changes Andy
> suggested and a changelog aligned with Ming's insights?
> 
> Thanks,
> Keith
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* nvme-pci: number of queues off by one
  2018-03-12  9:09             ` Ming Lei
  (?)
@ 2018-10-08  5:05             ` Prasun Ratn
  2018-10-08  5:59                 ` Dongli Zhang
  -1 siblings, 1 reply; 41+ messages in thread
From: Prasun Ratn @ 2018-10-08  5:05 UTC (permalink / raw)
  To: ming.lei
  Cc: keith.busch, hch, sagi, linux-nvme, axboe, jianchao.w.wang,
	linux-block, tglx

Hi

I have an NVMe SSD that has 8 hw queues and on older kernels I see all
8 show up. However on a recent kernel (I tried 4.18), I only see 7. Is
this a known issue?

$ uname -r
4.14.1-1.el7.elrepo.x86_64

$ ls /sys/block/nvme*n1/mq/*/cpu_list
/sys/block/nvme0n1/mq/0/cpu_list
/sys/block/nvme0n1/mq/1/cpu_list
/sys/block/nvme0n1/mq/2/cpu_list
/sys/block/nvme0n1/mq/3/cpu_list
/sys/block/nvme0n1/mq/4/cpu_list
/sys/block/nvme0n1/mq/5/cpu_list
/sys/block/nvme0n1/mq/6/cpu_list
/sys/block/nvme0n1/mq/7/cpu_list
/sys/block/nvme1n1/mq/0/cpu_list
/sys/block/nvme1n1/mq/1/cpu_list
/sys/block/nvme1n1/mq/2/cpu_list
/sys/block/nvme1n1/mq/3/cpu_list
/sys/block/nvme1n1/mq/4/cpu_list
/sys/block/nvme1n1/mq/5/cpu_list
/sys/block/nvme1n1/mq/6/cpu_list
/sys/block/nvme1n1/mq/7/cpu_list
/sys/block/nvme2n1/mq/0/cpu_list
/sys/block/nvme2n1/mq/1/cpu_list
/sys/block/nvme2n1/mq/2/cpu_list
/sys/block/nvme2n1/mq/3/cpu_list
/sys/block/nvme2n1/mq/4/cpu_list
/sys/block/nvme2n1/mq/5/cpu_list
/sys/block/nvme2n1/mq/6/cpu_list
/sys/block/nvme2n1/mq/7/cpu_list
/sys/block/nvme3n1/mq/0/cpu_list
/sys/block/nvme3n1/mq/1/cpu_list
/sys/block/nvme3n1/mq/2/cpu_list
/sys/block/nvme3n1/mq/3/cpu_list
/sys/block/nvme3n1/mq/4/cpu_list
/sys/block/nvme3n1/mq/5/cpu_list
/sys/block/nvme3n1/mq/6/cpu_list
/sys/block/nvme3n1/mq/7/cpu_list


$ uname -r
4.18.10-1.el7.elrepo.x86_64

$ ls /sys/block/nvme*n1/mq/*/cpu_list
/sys/block/nvme0n1/mq/0/cpu_list
/sys/block/nvme0n1/mq/1/cpu_list
/sys/block/nvme0n1/mq/2/cpu_list
/sys/block/nvme0n1/mq/3/cpu_list
/sys/block/nvme0n1/mq/4/cpu_list
/sys/block/nvme0n1/mq/5/cpu_list
/sys/block/nvme0n1/mq/6/cpu_list
/sys/block/nvme1n1/mq/0/cpu_list
/sys/block/nvme1n1/mq/1/cpu_list
/sys/block/nvme1n1/mq/2/cpu_list
/sys/block/nvme1n1/mq/3/cpu_list
/sys/block/nvme1n1/mq/4/cpu_list
/sys/block/nvme1n1/mq/5/cpu_list
/sys/block/nvme1n1/mq/6/cpu_list
/sys/block/nvme2n1/mq/0/cpu_list
/sys/block/nvme2n1/mq/1/cpu_list
/sys/block/nvme2n1/mq/2/cpu_list
/sys/block/nvme2n1/mq/3/cpu_list
/sys/block/nvme2n1/mq/4/cpu_list
/sys/block/nvme2n1/mq/5/cpu_list
/sys/block/nvme2n1/mq/6/cpu_list
/sys/block/nvme3n1/mq/0/cpu_list
/sys/block/nvme3n1/mq/1/cpu_list
/sys/block/nvme3n1/mq/2/cpu_list
/sys/block/nvme3n1/mq/3/cpu_list
/sys/block/nvme3n1/mq/4/cpu_list
/sys/block/nvme3n1/mq/5/cpu_list
/sys/block/nvme3n1/mq/6/cpu_list

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: nvme-pci: number of queues off by one
  2018-10-08  5:05             ` nvme-pci: number of queues off by one Prasun Ratn
@ 2018-10-08  5:59                 ` Dongli Zhang
  0 siblings, 0 replies; 41+ messages in thread
From: Dongli Zhang @ 2018-10-08  5:59 UTC (permalink / raw)
  To: Prasun Ratn, ming.lei
  Cc: keith.busch, hch, sagi, linux-nvme, axboe, jianchao.w.wang,
	linux-block, tglx

I can reproduce with qemu:

# ls /sys/block/nvme*n1/mq/*/cpu_list
/sys/block/nvme0n1/mq/0/cpu_list
/sys/block/nvme0n1/mq/1/cpu_list
/sys/block/nvme0n1/mq/2/cpu_list
/sys/block/nvme0n1/mq/3/cpu_list
/sys/block/nvme0n1/mq/4/cpu_list
/sys/block/nvme0n1/mq/5/cpu_list
/sys/block/nvme0n1/mq/6/cpu_list

Here is the qemu cmdline emulating 8-queue nvme while the VM has 12 cpu:

# qemu-system-x86_64 -m 4096 -smp 12 \
	-kernel /path-to-kernel/linux-4.18.10/arch/x86_64/boot/bzImage \
	-hda /path-to-img/ubuntu1804.qcow2  \
	-append "root=/dev/sda1 init=/sbin/init text" -enable-kvm \
	-net nic -net user,hostfwd=tcp::5022-:22 \
	-device nvme,drive=nvme1,serial=deadbeaf1,num_queues=8 \
	-drive file=/path-to-img/nvme.disk,if=none,id=nvme1

Dongli Zhang


On 10/08/2018 01:05 PM, Prasun Ratn wrote:
> Hi
> 
> I have an NVMe SSD that has 8 hw queues and on older kernels I see all
> 8 show up. However on a recent kernel (I tried 4.18), I only see 7. Is
> this a known issue?
> 
> $ uname -r
> 4.14.1-1.el7.elrepo.x86_64
> 
> $ ls /sys/block/nvme*n1/mq/*/cpu_list
> /sys/block/nvme0n1/mq/0/cpu_list
> /sys/block/nvme0n1/mq/1/cpu_list
> /sys/block/nvme0n1/mq/2/cpu_list
> /sys/block/nvme0n1/mq/3/cpu_list
> /sys/block/nvme0n1/mq/4/cpu_list
> /sys/block/nvme0n1/mq/5/cpu_list
> /sys/block/nvme0n1/mq/6/cpu_list
> /sys/block/nvme0n1/mq/7/cpu_list
> /sys/block/nvme1n1/mq/0/cpu_list
> /sys/block/nvme1n1/mq/1/cpu_list
> /sys/block/nvme1n1/mq/2/cpu_list
> /sys/block/nvme1n1/mq/3/cpu_list
> /sys/block/nvme1n1/mq/4/cpu_list
> /sys/block/nvme1n1/mq/5/cpu_list
> /sys/block/nvme1n1/mq/6/cpu_list
> /sys/block/nvme1n1/mq/7/cpu_list
> /sys/block/nvme2n1/mq/0/cpu_list
> /sys/block/nvme2n1/mq/1/cpu_list
> /sys/block/nvme2n1/mq/2/cpu_list
> /sys/block/nvme2n1/mq/3/cpu_list
> /sys/block/nvme2n1/mq/4/cpu_list
> /sys/block/nvme2n1/mq/5/cpu_list
> /sys/block/nvme2n1/mq/6/cpu_list
> /sys/block/nvme2n1/mq/7/cpu_list
> /sys/block/nvme3n1/mq/0/cpu_list
> /sys/block/nvme3n1/mq/1/cpu_list
> /sys/block/nvme3n1/mq/2/cpu_list
> /sys/block/nvme3n1/mq/3/cpu_list
> /sys/block/nvme3n1/mq/4/cpu_list
> /sys/block/nvme3n1/mq/5/cpu_list
> /sys/block/nvme3n1/mq/6/cpu_list
> /sys/block/nvme3n1/mq/7/cpu_list
> 
> 
> $ uname -r
> 4.18.10-1.el7.elrepo.x86_64
> 
> $ ls /sys/block/nvme*n1/mq/*/cpu_list
> /sys/block/nvme0n1/mq/0/cpu_list
> /sys/block/nvme0n1/mq/1/cpu_list
> /sys/block/nvme0n1/mq/2/cpu_list
> /sys/block/nvme0n1/mq/3/cpu_list
> /sys/block/nvme0n1/mq/4/cpu_list
> /sys/block/nvme0n1/mq/5/cpu_list
> /sys/block/nvme0n1/mq/6/cpu_list
> /sys/block/nvme1n1/mq/0/cpu_list
> /sys/block/nvme1n1/mq/1/cpu_list
> /sys/block/nvme1n1/mq/2/cpu_list
> /sys/block/nvme1n1/mq/3/cpu_list
> /sys/block/nvme1n1/mq/4/cpu_list
> /sys/block/nvme1n1/mq/5/cpu_list
> /sys/block/nvme1n1/mq/6/cpu_list
> /sys/block/nvme2n1/mq/0/cpu_list
> /sys/block/nvme2n1/mq/1/cpu_list
> /sys/block/nvme2n1/mq/2/cpu_list
> /sys/block/nvme2n1/mq/3/cpu_list
> /sys/block/nvme2n1/mq/4/cpu_list
> /sys/block/nvme2n1/mq/5/cpu_list
> /sys/block/nvme2n1/mq/6/cpu_list
> /sys/block/nvme3n1/mq/0/cpu_list
> /sys/block/nvme3n1/mq/1/cpu_list
> /sys/block/nvme3n1/mq/2/cpu_list
> /sys/block/nvme3n1/mq/3/cpu_list
> /sys/block/nvme3n1/mq/4/cpu_list
> /sys/block/nvme3n1/mq/5/cpu_list
> /sys/block/nvme3n1/mq/6/cpu_list
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* nvme-pci: number of queues off by one
@ 2018-10-08  5:59                 ` Dongli Zhang
  0 siblings, 0 replies; 41+ messages in thread
From: Dongli Zhang @ 2018-10-08  5:59 UTC (permalink / raw)


I can reproduce with qemu:

# ls /sys/block/nvme*n1/mq/*/cpu_list
/sys/block/nvme0n1/mq/0/cpu_list
/sys/block/nvme0n1/mq/1/cpu_list
/sys/block/nvme0n1/mq/2/cpu_list
/sys/block/nvme0n1/mq/3/cpu_list
/sys/block/nvme0n1/mq/4/cpu_list
/sys/block/nvme0n1/mq/5/cpu_list
/sys/block/nvme0n1/mq/6/cpu_list

Here is the qemu cmdline emulating 8-queue nvme while the VM has 12 cpu:

# qemu-system-x86_64 -m 4096 -smp 12 \
	-kernel /path-to-kernel/linux-4.18.10/arch/x86_64/boot/bzImage \
	-hda /path-to-img/ubuntu1804.qcow2  \
	-append "root=/dev/sda1 init=/sbin/init text" -enable-kvm \
	-net nic -net user,hostfwd=tcp::5022-:22 \
	-device nvme,drive=nvme1,serial=deadbeaf1,num_queues=8 \
	-drive file=/path-to-img/nvme.disk,if=none,id=nvme1

Dongli Zhang


On 10/08/2018 01:05 PM, Prasun Ratn wrote:
> Hi
> 
> I have an NVMe SSD that has 8 hw queues and on older kernels I see all
> 8 show up. However on a recent kernel (I tried 4.18), I only see 7. Is
> this a known issue?
> 
> $ uname -r
> 4.14.1-1.el7.elrepo.x86_64
> 
> $ ls /sys/block/nvme*n1/mq/*/cpu_list
> /sys/block/nvme0n1/mq/0/cpu_list
> /sys/block/nvme0n1/mq/1/cpu_list
> /sys/block/nvme0n1/mq/2/cpu_list
> /sys/block/nvme0n1/mq/3/cpu_list
> /sys/block/nvme0n1/mq/4/cpu_list
> /sys/block/nvme0n1/mq/5/cpu_list
> /sys/block/nvme0n1/mq/6/cpu_list
> /sys/block/nvme0n1/mq/7/cpu_list
> /sys/block/nvme1n1/mq/0/cpu_list
> /sys/block/nvme1n1/mq/1/cpu_list
> /sys/block/nvme1n1/mq/2/cpu_list
> /sys/block/nvme1n1/mq/3/cpu_list
> /sys/block/nvme1n1/mq/4/cpu_list
> /sys/block/nvme1n1/mq/5/cpu_list
> /sys/block/nvme1n1/mq/6/cpu_list
> /sys/block/nvme1n1/mq/7/cpu_list
> /sys/block/nvme2n1/mq/0/cpu_list
> /sys/block/nvme2n1/mq/1/cpu_list
> /sys/block/nvme2n1/mq/2/cpu_list
> /sys/block/nvme2n1/mq/3/cpu_list
> /sys/block/nvme2n1/mq/4/cpu_list
> /sys/block/nvme2n1/mq/5/cpu_list
> /sys/block/nvme2n1/mq/6/cpu_list
> /sys/block/nvme2n1/mq/7/cpu_list
> /sys/block/nvme3n1/mq/0/cpu_list
> /sys/block/nvme3n1/mq/1/cpu_list
> /sys/block/nvme3n1/mq/2/cpu_list
> /sys/block/nvme3n1/mq/3/cpu_list
> /sys/block/nvme3n1/mq/4/cpu_list
> /sys/block/nvme3n1/mq/5/cpu_list
> /sys/block/nvme3n1/mq/6/cpu_list
> /sys/block/nvme3n1/mq/7/cpu_list
> 
> 
> $ uname -r
> 4.18.10-1.el7.elrepo.x86_64
> 
> $ ls /sys/block/nvme*n1/mq/*/cpu_list
> /sys/block/nvme0n1/mq/0/cpu_list
> /sys/block/nvme0n1/mq/1/cpu_list
> /sys/block/nvme0n1/mq/2/cpu_list
> /sys/block/nvme0n1/mq/3/cpu_list
> /sys/block/nvme0n1/mq/4/cpu_list
> /sys/block/nvme0n1/mq/5/cpu_list
> /sys/block/nvme0n1/mq/6/cpu_list
> /sys/block/nvme1n1/mq/0/cpu_list
> /sys/block/nvme1n1/mq/1/cpu_list
> /sys/block/nvme1n1/mq/2/cpu_list
> /sys/block/nvme1n1/mq/3/cpu_list
> /sys/block/nvme1n1/mq/4/cpu_list
> /sys/block/nvme1n1/mq/5/cpu_list
> /sys/block/nvme1n1/mq/6/cpu_list
> /sys/block/nvme2n1/mq/0/cpu_list
> /sys/block/nvme2n1/mq/1/cpu_list
> /sys/block/nvme2n1/mq/2/cpu_list
> /sys/block/nvme2n1/mq/3/cpu_list
> /sys/block/nvme2n1/mq/4/cpu_list
> /sys/block/nvme2n1/mq/5/cpu_list
> /sys/block/nvme2n1/mq/6/cpu_list
> /sys/block/nvme3n1/mq/0/cpu_list
> /sys/block/nvme3n1/mq/1/cpu_list
> /sys/block/nvme3n1/mq/2/cpu_list
> /sys/block/nvme3n1/mq/3/cpu_list
> /sys/block/nvme3n1/mq/4/cpu_list
> /sys/block/nvme3n1/mq/5/cpu_list
> /sys/block/nvme3n1/mq/6/cpu_list
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: nvme-pci: number of queues off by one
  2018-10-08  5:59                 ` Dongli Zhang
@ 2018-10-08  6:58                   ` Dongli Zhang
  -1 siblings, 0 replies; 41+ messages in thread
From: Dongli Zhang @ 2018-10-08  6:58 UTC (permalink / raw)
  To: Prasun Ratn, ming.lei, jianchao.w.wang
  Cc: keith.busch, hch, sagi, linux-nvme, axboe, linux-block, tglx

I got the same result when emulating nvme with qemu: the VM has 12 cpu, while
the num_queues of nvme is 8.

# uname -r
4.14.1
# ll /sys/block/nvme*n1/mq/*/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/0/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/1/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/2/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/3/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/4/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/5/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/6/cpu_list


# uname -r
4.18.10
# ll /sys/block/nvme*n1/mq/*/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/0/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/1/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/2/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/3/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/4/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/5/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/6/cpu_list

>From below qemu source code, when n->num_queues is 8, the handler of
NVME_FEAT_NUM_QUEUES returns 0x60006.

 719 static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 720 {
 721     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
 722     uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 723
 724     switch (dw10) {
 725     case NVME_VOLATILE_WRITE_CACHE:
 726         blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
 727         break;
 728     case NVME_NUMBER_OF_QUEUES:
 729         trace_nvme_setfeat_numq((dw11 & 0xFFFF) + 1,
 730                                 ((dw11 >> 16) & 0xFFFF) + 1,
 731                                 n->num_queues - 1, n->num_queues - 1);
 732         req->cqe.result =
 733             cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
----> returns 0x60006 when num_queues is 8.


Finally, nr_io_queues is set to 6+1=7 in nvme_set_queue_count() in VM kernel.

I do not know how to paraphrase this in the world of nvme.

Dongli Zhang

On 10/08/2018 01:59 PM, Dongli Zhang wrote:
> I can reproduce with qemu:
> 
> # ls /sys/block/nvme*n1/mq/*/cpu_list
> /sys/block/nvme0n1/mq/0/cpu_list
> /sys/block/nvme0n1/mq/1/cpu_list
> /sys/block/nvme0n1/mq/2/cpu_list
> /sys/block/nvme0n1/mq/3/cpu_list
> /sys/block/nvme0n1/mq/4/cpu_list
> /sys/block/nvme0n1/mq/5/cpu_list
> /sys/block/nvme0n1/mq/6/cpu_list
> 
> Here is the qemu cmdline emulating 8-queue nvme while the VM has 12 cpu:
> 
> # qemu-system-x86_64 -m 4096 -smp 12 \
> 	-kernel /path-to-kernel/linux-4.18.10/arch/x86_64/boot/bzImage \
> 	-hda /path-to-img/ubuntu1804.qcow2  \
> 	-append "root=/dev/sda1 init=/sbin/init text" -enable-kvm \
> 	-net nic -net user,hostfwd=tcp::5022-:22 \
> 	-device nvme,drive=nvme1,serial=deadbeaf1,num_queues=8 \
> 	-drive file=/path-to-img/nvme.disk,if=none,id=nvme1
> 
> Dongli Zhang
> 
> 
> On 10/08/2018 01:05 PM, Prasun Ratn wrote:
>> Hi
>>
>> I have an NVMe SSD that has 8 hw queues and on older kernels I see all
>> 8 show up. However on a recent kernel (I tried 4.18), I only see 7. Is
>> this a known issue?
>>
>> $ uname -r
>> 4.14.1-1.el7.elrepo.x86_64
>>
>> $ ls /sys/block/nvme*n1/mq/*/cpu_list
>> /sys/block/nvme0n1/mq/0/cpu_list
>> /sys/block/nvme0n1/mq/1/cpu_list
>> /sys/block/nvme0n1/mq/2/cpu_list
>> /sys/block/nvme0n1/mq/3/cpu_list
>> /sys/block/nvme0n1/mq/4/cpu_list
>> /sys/block/nvme0n1/mq/5/cpu_list
>> /sys/block/nvme0n1/mq/6/cpu_list
>> /sys/block/nvme0n1/mq/7/cpu_list
>> /sys/block/nvme1n1/mq/0/cpu_list
>> /sys/block/nvme1n1/mq/1/cpu_list
>> /sys/block/nvme1n1/mq/2/cpu_list
>> /sys/block/nvme1n1/mq/3/cpu_list
>> /sys/block/nvme1n1/mq/4/cpu_list
>> /sys/block/nvme1n1/mq/5/cpu_list
>> /sys/block/nvme1n1/mq/6/cpu_list
>> /sys/block/nvme1n1/mq/7/cpu_list
>> /sys/block/nvme2n1/mq/0/cpu_list
>> /sys/block/nvme2n1/mq/1/cpu_list
>> /sys/block/nvme2n1/mq/2/cpu_list
>> /sys/block/nvme2n1/mq/3/cpu_list
>> /sys/block/nvme2n1/mq/4/cpu_list
>> /sys/block/nvme2n1/mq/5/cpu_list
>> /sys/block/nvme2n1/mq/6/cpu_list
>> /sys/block/nvme2n1/mq/7/cpu_list
>> /sys/block/nvme3n1/mq/0/cpu_list
>> /sys/block/nvme3n1/mq/1/cpu_list
>> /sys/block/nvme3n1/mq/2/cpu_list
>> /sys/block/nvme3n1/mq/3/cpu_list
>> /sys/block/nvme3n1/mq/4/cpu_list
>> /sys/block/nvme3n1/mq/5/cpu_list
>> /sys/block/nvme3n1/mq/6/cpu_list
>> /sys/block/nvme3n1/mq/7/cpu_list
>>
>>
>> $ uname -r
>> 4.18.10-1.el7.elrepo.x86_64
>>
>> $ ls /sys/block/nvme*n1/mq/*/cpu_list
>> /sys/block/nvme0n1/mq/0/cpu_list
>> /sys/block/nvme0n1/mq/1/cpu_list
>> /sys/block/nvme0n1/mq/2/cpu_list
>> /sys/block/nvme0n1/mq/3/cpu_list
>> /sys/block/nvme0n1/mq/4/cpu_list
>> /sys/block/nvme0n1/mq/5/cpu_list
>> /sys/block/nvme0n1/mq/6/cpu_list
>> /sys/block/nvme1n1/mq/0/cpu_list
>> /sys/block/nvme1n1/mq/1/cpu_list
>> /sys/block/nvme1n1/mq/2/cpu_list
>> /sys/block/nvme1n1/mq/3/cpu_list
>> /sys/block/nvme1n1/mq/4/cpu_list
>> /sys/block/nvme1n1/mq/5/cpu_list
>> /sys/block/nvme1n1/mq/6/cpu_list
>> /sys/block/nvme2n1/mq/0/cpu_list
>> /sys/block/nvme2n1/mq/1/cpu_list
>> /sys/block/nvme2n1/mq/2/cpu_list
>> /sys/block/nvme2n1/mq/3/cpu_list
>> /sys/block/nvme2n1/mq/4/cpu_list
>> /sys/block/nvme2n1/mq/5/cpu_list
>> /sys/block/nvme2n1/mq/6/cpu_list
>> /sys/block/nvme3n1/mq/0/cpu_list
>> /sys/block/nvme3n1/mq/1/cpu_list
>> /sys/block/nvme3n1/mq/2/cpu_list
>> /sys/block/nvme3n1/mq/3/cpu_list
>> /sys/block/nvme3n1/mq/4/cpu_list
>> /sys/block/nvme3n1/mq/5/cpu_list
>> /sys/block/nvme3n1/mq/6/cpu_list
>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* nvme-pci: number of queues off by one
@ 2018-10-08  6:58                   ` Dongli Zhang
  0 siblings, 0 replies; 41+ messages in thread
From: Dongli Zhang @ 2018-10-08  6:58 UTC (permalink / raw)


I got the same result when emulating nvme with qemu: the VM has 12 cpu, while
the num_queues of nvme is 8.

# uname -r
4.14.1
# ll /sys/block/nvme*n1/mq/*/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/0/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/1/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/2/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/3/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/4/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/5/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/6/cpu_list


# uname -r
4.18.10
# ll /sys/block/nvme*n1/mq/*/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/0/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/1/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/2/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/3/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/4/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/5/cpu_list
-r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/6/cpu_list

>From below qemu source code, when n->num_queues is 8, the handler of
NVME_FEAT_NUM_QUEUES returns 0x60006.

 719 static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 720 {
 721     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
 722     uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 723
 724     switch (dw10) {
 725     case NVME_VOLATILE_WRITE_CACHE:
 726         blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
 727         break;
 728     case NVME_NUMBER_OF_QUEUES:
 729         trace_nvme_setfeat_numq((dw11 & 0xFFFF) + 1,
 730                                 ((dw11 >> 16) & 0xFFFF) + 1,
 731                                 n->num_queues - 1, n->num_queues - 1);
 732         req->cqe.result =
 733             cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
----> returns 0x60006 when num_queues is 8.


Finally, nr_io_queues is set to 6+1=7 in nvme_set_queue_count() in VM kernel.

I do not know how to paraphrase this in the world of nvme.

Dongli Zhang

On 10/08/2018 01:59 PM, Dongli Zhang wrote:
> I can reproduce with qemu:
> 
> # ls /sys/block/nvme*n1/mq/*/cpu_list
> /sys/block/nvme0n1/mq/0/cpu_list
> /sys/block/nvme0n1/mq/1/cpu_list
> /sys/block/nvme0n1/mq/2/cpu_list
> /sys/block/nvme0n1/mq/3/cpu_list
> /sys/block/nvme0n1/mq/4/cpu_list
> /sys/block/nvme0n1/mq/5/cpu_list
> /sys/block/nvme0n1/mq/6/cpu_list
> 
> Here is the qemu cmdline emulating 8-queue nvme while the VM has 12 cpu:
> 
> # qemu-system-x86_64 -m 4096 -smp 12 \
> 	-kernel /path-to-kernel/linux-4.18.10/arch/x86_64/boot/bzImage \
> 	-hda /path-to-img/ubuntu1804.qcow2  \
> 	-append "root=/dev/sda1 init=/sbin/init text" -enable-kvm \
> 	-net nic -net user,hostfwd=tcp::5022-:22 \
> 	-device nvme,drive=nvme1,serial=deadbeaf1,num_queues=8 \
> 	-drive file=/path-to-img/nvme.disk,if=none,id=nvme1
> 
> Dongli Zhang
> 
> 
> On 10/08/2018 01:05 PM, Prasun Ratn wrote:
>> Hi
>>
>> I have an NVMe SSD that has 8 hw queues and on older kernels I see all
>> 8 show up. However on a recent kernel (I tried 4.18), I only see 7. Is
>> this a known issue?
>>
>> $ uname -r
>> 4.14.1-1.el7.elrepo.x86_64
>>
>> $ ls /sys/block/nvme*n1/mq/*/cpu_list
>> /sys/block/nvme0n1/mq/0/cpu_list
>> /sys/block/nvme0n1/mq/1/cpu_list
>> /sys/block/nvme0n1/mq/2/cpu_list
>> /sys/block/nvme0n1/mq/3/cpu_list
>> /sys/block/nvme0n1/mq/4/cpu_list
>> /sys/block/nvme0n1/mq/5/cpu_list
>> /sys/block/nvme0n1/mq/6/cpu_list
>> /sys/block/nvme0n1/mq/7/cpu_list
>> /sys/block/nvme1n1/mq/0/cpu_list
>> /sys/block/nvme1n1/mq/1/cpu_list
>> /sys/block/nvme1n1/mq/2/cpu_list
>> /sys/block/nvme1n1/mq/3/cpu_list
>> /sys/block/nvme1n1/mq/4/cpu_list
>> /sys/block/nvme1n1/mq/5/cpu_list
>> /sys/block/nvme1n1/mq/6/cpu_list
>> /sys/block/nvme1n1/mq/7/cpu_list
>> /sys/block/nvme2n1/mq/0/cpu_list
>> /sys/block/nvme2n1/mq/1/cpu_list
>> /sys/block/nvme2n1/mq/2/cpu_list
>> /sys/block/nvme2n1/mq/3/cpu_list
>> /sys/block/nvme2n1/mq/4/cpu_list
>> /sys/block/nvme2n1/mq/5/cpu_list
>> /sys/block/nvme2n1/mq/6/cpu_list
>> /sys/block/nvme2n1/mq/7/cpu_list
>> /sys/block/nvme3n1/mq/0/cpu_list
>> /sys/block/nvme3n1/mq/1/cpu_list
>> /sys/block/nvme3n1/mq/2/cpu_list
>> /sys/block/nvme3n1/mq/3/cpu_list
>> /sys/block/nvme3n1/mq/4/cpu_list
>> /sys/block/nvme3n1/mq/5/cpu_list
>> /sys/block/nvme3n1/mq/6/cpu_list
>> /sys/block/nvme3n1/mq/7/cpu_list
>>
>>
>> $ uname -r
>> 4.18.10-1.el7.elrepo.x86_64
>>
>> $ ls /sys/block/nvme*n1/mq/*/cpu_list
>> /sys/block/nvme0n1/mq/0/cpu_list
>> /sys/block/nvme0n1/mq/1/cpu_list
>> /sys/block/nvme0n1/mq/2/cpu_list
>> /sys/block/nvme0n1/mq/3/cpu_list
>> /sys/block/nvme0n1/mq/4/cpu_list
>> /sys/block/nvme0n1/mq/5/cpu_list
>> /sys/block/nvme0n1/mq/6/cpu_list
>> /sys/block/nvme1n1/mq/0/cpu_list
>> /sys/block/nvme1n1/mq/1/cpu_list
>> /sys/block/nvme1n1/mq/2/cpu_list
>> /sys/block/nvme1n1/mq/3/cpu_list
>> /sys/block/nvme1n1/mq/4/cpu_list
>> /sys/block/nvme1n1/mq/5/cpu_list
>> /sys/block/nvme1n1/mq/6/cpu_list
>> /sys/block/nvme2n1/mq/0/cpu_list
>> /sys/block/nvme2n1/mq/1/cpu_list
>> /sys/block/nvme2n1/mq/2/cpu_list
>> /sys/block/nvme2n1/mq/3/cpu_list
>> /sys/block/nvme2n1/mq/4/cpu_list
>> /sys/block/nvme2n1/mq/5/cpu_list
>> /sys/block/nvme2n1/mq/6/cpu_list
>> /sys/block/nvme3n1/mq/0/cpu_list
>> /sys/block/nvme3n1/mq/1/cpu_list
>> /sys/block/nvme3n1/mq/2/cpu_list
>> /sys/block/nvme3n1/mq/3/cpu_list
>> /sys/block/nvme3n1/mq/4/cpu_list
>> /sys/block/nvme3n1/mq/5/cpu_list
>> /sys/block/nvme3n1/mq/6/cpu_list
>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: nvme-pci: number of queues off by one
  2018-10-08  5:59                 ` Dongli Zhang
@ 2018-10-08 10:19                   ` Ming Lei
  -1 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-10-08 10:19 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Prasun Ratn, axboe, sagi, linux-nvme, keith.busch, linux-block,
	jianchao.w.wang, tglx, hch

On Mon, Oct 08, 2018 at 01:59:05PM +0800, Dongli Zhang wrote:
> I can reproduce with qemu:
> 
> # ls /sys/block/nvme*n1/mq/*/cpu_list
> /sys/block/nvme0n1/mq/0/cpu_list
> /sys/block/nvme0n1/mq/1/cpu_list
> /sys/block/nvme0n1/mq/2/cpu_list
> /sys/block/nvme0n1/mq/3/cpu_list
> /sys/block/nvme0n1/mq/4/cpu_list
> /sys/block/nvme0n1/mq/5/cpu_list
> /sys/block/nvme0n1/mq/6/cpu_list
> 
> Here is the qemu cmdline emulating 8-queue nvme while the VM has 12 cpu:
> 
> # qemu-system-x86_64 -m 4096 -smp 12 \
> 	-kernel /path-to-kernel/linux-4.18.10/arch/x86_64/boot/bzImage \
> 	-hda /path-to-img/ubuntu1804.qcow2  \
> 	-append "root=/dev/sda1 init=/sbin/init text" -enable-kvm \
> 	-net nic -net user,hostfwd=tcp::5022-:22 \
> 	-device nvme,drive=nvme1,serial=deadbeaf1,num_queues=8 \
> 	-drive file=/path-to-img/nvme.disk,if=none,id=nvme1

This 'issue' can be reproduced on v4.14 too.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* nvme-pci: number of queues off by one
@ 2018-10-08 10:19                   ` Ming Lei
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-10-08 10:19 UTC (permalink / raw)


On Mon, Oct 08, 2018@01:59:05PM +0800, Dongli Zhang wrote:
> I can reproduce with qemu:
> 
> # ls /sys/block/nvme*n1/mq/*/cpu_list
> /sys/block/nvme0n1/mq/0/cpu_list
> /sys/block/nvme0n1/mq/1/cpu_list
> /sys/block/nvme0n1/mq/2/cpu_list
> /sys/block/nvme0n1/mq/3/cpu_list
> /sys/block/nvme0n1/mq/4/cpu_list
> /sys/block/nvme0n1/mq/5/cpu_list
> /sys/block/nvme0n1/mq/6/cpu_list
> 
> Here is the qemu cmdline emulating 8-queue nvme while the VM has 12 cpu:
> 
> # qemu-system-x86_64 -m 4096 -smp 12 \
> 	-kernel /path-to-kernel/linux-4.18.10/arch/x86_64/boot/bzImage \
> 	-hda /path-to-img/ubuntu1804.qcow2  \
> 	-append "root=/dev/sda1 init=/sbin/init text" -enable-kvm \
> 	-net nic -net user,hostfwd=tcp::5022-:22 \
> 	-device nvme,drive=nvme1,serial=deadbeaf1,num_queues=8 \
> 	-drive file=/path-to-img/nvme.disk,if=none,id=nvme1

This 'issue' can be reproduced on v4.14 too.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: nvme-pci: number of queues off by one
  2018-10-08  6:58                   ` Dongli Zhang
@ 2018-10-08 14:54                     ` Keith Busch
  -1 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2018-10-08 14:54 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Prasun Ratn, ming.lei, jianchao.w.wang, hch, sagi, linux-nvme,
	axboe, linux-block, tglx

On Mon, Oct 08, 2018 at 02:58:21PM +0800, Dongli Zhang wrote:
> I got the same result when emulating nvme with qemu: the VM has 12 cpu, while
> the num_queues of nvme is 8.
> 
> # uname -r
> 4.14.1
> # ll /sys/block/nvme*n1/mq/*/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/0/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/1/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/2/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/3/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/4/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/5/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/6/cpu_list
> 
> 
> # uname -r
> 4.18.10
> # ll /sys/block/nvme*n1/mq/*/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/0/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/1/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/2/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/3/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/4/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/5/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/6/cpu_list
> 
> From below qemu source code, when n->num_queues is 8, the handler of
> NVME_FEAT_NUM_QUEUES returns 0x60006.
> 
>  719 static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  720 {
>  721     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
>  722     uint32_t dw11 = le32_to_cpu(cmd->cdw11);
>  723
>  724     switch (dw10) {
>  725     case NVME_VOLATILE_WRITE_CACHE:
>  726         blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
>  727         break;
>  728     case NVME_NUMBER_OF_QUEUES:
>  729         trace_nvme_setfeat_numq((dw11 & 0xFFFF) + 1,
>  730                                 ((dw11 >> 16) & 0xFFFF) + 1,
>  731                                 n->num_queues - 1, n->num_queues - 1);
>  732         req->cqe.result =
>  733             cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
> ----> returns 0x60006 when num_queues is 8.
> 
> 
> Finally, nr_io_queues is set to 6+1=7 in nvme_set_queue_count() in VM kernel.
> 
> I do not know how to paraphrase this in the world of nvme.
> 
> Dongli Zhang
> 
> On 10/08/2018 01:59 PM, Dongli Zhang wrote:
> > I can reproduce with qemu:
> > 
> > # ls /sys/block/nvme*n1/mq/*/cpu_list
> > /sys/block/nvme0n1/mq/0/cpu_list
> > /sys/block/nvme0n1/mq/1/cpu_list
> > /sys/block/nvme0n1/mq/2/cpu_list
> > /sys/block/nvme0n1/mq/3/cpu_list
> > /sys/block/nvme0n1/mq/4/cpu_list
> > /sys/block/nvme0n1/mq/5/cpu_list
> > /sys/block/nvme0n1/mq/6/cpu_list
> > 
> > Here is the qemu cmdline emulating 8-queue nvme while the VM has 12 cpu:
> > 
> > # qemu-system-x86_64 -m 4096 -smp 12 \
> > 	-kernel /path-to-kernel/linux-4.18.10/arch/x86_64/boot/bzImage \
> > 	-hda /path-to-img/ubuntu1804.qcow2  \
> > 	-append "root=/dev/sda1 init=/sbin/init text" -enable-kvm \
> > 	-net nic -net user,hostfwd=tcp::5022-:22 \
> > 	-device nvme,drive=nvme1,serial=deadbeaf1,num_queues=8 \
> > 	-drive file=/path-to-img/nvme.disk,if=none,id=nvme1
> > 
> > Dongli Zhang

Qemu counts one of those queues as the admin queue.

> > On 10/08/2018 01:05 PM, Prasun Ratn wrote:
> >> Hi
> >>
> >> I have an NVMe SSD that has 8 hw queues and on older kernels I see all
> >> 8 show up. However on a recent kernel (I tried 4.18), I only see 7. Is
> >> this a known issue?

That probably means you only have 8 MSI-x vectors, one of which is
reserved for the admin queue. We used to share an IO vector with the
admin queue, however some people figured out you can break your controller
that way with the linux irq spread.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* nvme-pci: number of queues off by one
@ 2018-10-08 14:54                     ` Keith Busch
  0 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2018-10-08 14:54 UTC (permalink / raw)


On Mon, Oct 08, 2018@02:58:21PM +0800, Dongli Zhang wrote:
> I got the same result when emulating nvme with qemu: the VM has 12 cpu, while
> the num_queues of nvme is 8.
> 
> # uname -r
> 4.14.1
> # ll /sys/block/nvme*n1/mq/*/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/0/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/1/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/2/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/3/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/4/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/5/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/6/cpu_list
> 
> 
> # uname -r
> 4.18.10
> # ll /sys/block/nvme*n1/mq/*/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/0/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/1/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/2/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/3/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/4/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/5/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/6/cpu_list
> 
> From below qemu source code, when n->num_queues is 8, the handler of
> NVME_FEAT_NUM_QUEUES returns 0x60006.
> 
>  719 static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  720 {
>  721     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
>  722     uint32_t dw11 = le32_to_cpu(cmd->cdw11);
>  723
>  724     switch (dw10) {
>  725     case NVME_VOLATILE_WRITE_CACHE:
>  726         blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
>  727         break;
>  728     case NVME_NUMBER_OF_QUEUES:
>  729         trace_nvme_setfeat_numq((dw11 & 0xFFFF) + 1,
>  730                                 ((dw11 >> 16) & 0xFFFF) + 1,
>  731                                 n->num_queues - 1, n->num_queues - 1);
>  732         req->cqe.result =
>  733             cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
> ----> returns 0x60006 when num_queues is 8.
> 
> 
> Finally, nr_io_queues is set to 6+1=7 in nvme_set_queue_count() in VM kernel.
> 
> I do not know how to paraphrase this in the world of nvme.
> 
> Dongli Zhang
> 
> On 10/08/2018 01:59 PM, Dongli Zhang wrote:
> > I can reproduce with qemu:
> > 
> > # ls /sys/block/nvme*n1/mq/*/cpu_list
> > /sys/block/nvme0n1/mq/0/cpu_list
> > /sys/block/nvme0n1/mq/1/cpu_list
> > /sys/block/nvme0n1/mq/2/cpu_list
> > /sys/block/nvme0n1/mq/3/cpu_list
> > /sys/block/nvme0n1/mq/4/cpu_list
> > /sys/block/nvme0n1/mq/5/cpu_list
> > /sys/block/nvme0n1/mq/6/cpu_list
> > 
> > Here is the qemu cmdline emulating 8-queue nvme while the VM has 12 cpu:
> > 
> > # qemu-system-x86_64 -m 4096 -smp 12 \
> > 	-kernel /path-to-kernel/linux-4.18.10/arch/x86_64/boot/bzImage \
> > 	-hda /path-to-img/ubuntu1804.qcow2  \
> > 	-append "root=/dev/sda1 init=/sbin/init text" -enable-kvm \
> > 	-net nic -net user,hostfwd=tcp::5022-:22 \
> > 	-device nvme,drive=nvme1,serial=deadbeaf1,num_queues=8 \
> > 	-drive file=/path-to-img/nvme.disk,if=none,id=nvme1
> > 
> > Dongli Zhang

Qemu counts one of those queues as the admin queue.

> > On 10/08/2018 01:05 PM, Prasun Ratn wrote:
> >> Hi
> >>
> >> I have an NVMe SSD that has 8 hw queues and on older kernels I see all
> >> 8 show up. However on a recent kernel (I tried 4.18), I only see 7. Is
> >> this a known issue?

That probably means you only have 8 MSI-x vectors, one of which is
reserved for the admin queue. We used to share an IO vector with the
admin queue, however some people figured out you can break your controller
that way with the linux irq spread.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2018-10-08 22:07 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-28 15:48 [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0 Jianchao Wang
2018-02-28 15:48 ` Jianchao Wang
2018-02-28 15:59 ` Andy Shevchenko
2018-02-28 15:59   ` Andy Shevchenko
2018-03-02  3:06   ` jianchao.wang
2018-03-02  3:06     ` jianchao.wang
2018-03-12 18:59     ` Keith Busch
2018-03-12 18:59       ` Keith Busch
2018-03-13  1:47       ` jianchao.wang
2018-03-13  1:47         ` jianchao.wang
2018-02-28 16:47 ` Christoph Hellwig
2018-02-28 16:47   ` Christoph Hellwig
2018-03-01  9:28   ` Sagi Grimberg
2018-03-01  9:28     ` Sagi Grimberg
2018-03-01 10:05     ` jianchao.wang
2018-03-01 10:05       ` jianchao.wang
2018-03-01 15:15       ` Keith Busch
2018-03-01 15:15         ` Keith Busch
2018-03-02  3:11         ` jianchao.wang
2018-03-02  3:11           ` jianchao.wang
2018-03-01 15:03   ` Ming Lei
2018-03-01 15:03     ` Ming Lei
2018-03-01 16:10     ` Keith Busch
2018-03-01 16:10       ` Keith Busch
2018-03-08  7:42       ` Christoph Hellwig
2018-03-08  7:42         ` Christoph Hellwig
2018-03-09 17:24         ` Keith Busch
2018-03-09 17:24           ` Keith Busch
2018-03-12  9:09           ` Ming Lei
2018-03-12  9:09             ` Ming Lei
2018-10-08  5:05             ` nvme-pci: number of queues off by one Prasun Ratn
2018-10-08  5:59               ` Dongli Zhang
2018-10-08  5:59                 ` Dongli Zhang
2018-10-08  6:58                 ` Dongli Zhang
2018-10-08  6:58                   ` Dongli Zhang
2018-10-08 14:54                   ` Keith Busch
2018-10-08 14:54                     ` Keith Busch
2018-10-08 10:19                 ` Ming Lei
2018-10-08 10:19                   ` Ming Lei
2018-03-02  3:18   ` [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0 jianchao.wang
2018-03-02  3:18     ` jianchao.wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.