All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] NVMe: Initialization error handling fixups
@ 2015-06-08 16:08 Keith Busch
  2015-06-08 16:08 ` [PATCH 1/3] NVMe: Fix device cleanup on initialization failure Keith Busch
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Keith Busch @ 2015-06-08 16:08 UTC (permalink / raw)


The asynchronous device probe created some interesting deadlocks and
accessing unallocated resources if the device failed during probe. This
series checks for resources allocated and owned by the driver for proper
cleanup, returns the expected errors on driver aborted commands, and
fixes the deadlocks between the probe and reset workers.

Keith Busch (3):
  NVMe: Fix device cleanup on initialization failure
  NVMe: Don't return fake status on cancelled
  NVMe: Unify controller probe and resume

 drivers/block/nvme-core.c |   75 +++++++++++++++++++++++++++++----------------
 1 file changed, 49 insertions(+), 26 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] NVMe: Fix device cleanup on initialization failure
  2015-06-08 16:08 [PATCH 0/3] NVMe: Initialization error handling fixups Keith Busch
@ 2015-06-08 16:08 ` Keith Busch
  2015-06-08 16:08 ` [PATCH 2/3] NVMe: Don't use fake status on cancelled command Keith Busch
  2015-06-08 16:08 ` [PATCH 3/3] NVMe: Unify controller probe and resume Keith Busch
  2 siblings, 0 replies; 14+ messages in thread
From: Keith Busch @ 2015-06-08 16:08 UTC (permalink / raw)


Don't release block queue and tagging resoureces if the driver never
got them in the first place. This can happen if the controller fails to
become ready, if memory wasn't available to allocate a tagset or admin
queue, or if the resources were released as part of error recovery.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/block/nvme-core.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 2072ae8..8513321 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -193,6 +193,13 @@ static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 	return 0;
 }
 
+static void nvme_admin_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+	struct nvme_queue *nvmeq = hctx->driver_data;
+
+	nvmeq->tags = NULL;
+}
+
 static int nvme_admin_init_request(void *data, struct request *req,
 				unsigned int hctx_idx, unsigned int rq_idx,
 				unsigned int numa_node)
@@ -1605,6 +1612,7 @@ static struct blk_mq_ops nvme_mq_admin_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_admin_init_hctx,
+	.exit_hctx      = nvme_admin_exit_hctx,
 	.init_request	= nvme_admin_init_request,
 	.timeout	= nvme_timeout,
 };
@@ -1647,6 +1655,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 		}
 		if (!blk_get_queue(dev->admin_q)) {
 			nvme_dev_remove_admin(dev);
+			dev->admin_q = NULL;
 			return -ENODEV;
 		}
 	} else
@@ -2733,8 +2742,10 @@ static void nvme_free_dev(struct kref *kref)
 	put_device(dev->device);
 	nvme_free_namespaces(dev);
 	nvme_release_instance(dev);
-	blk_mq_free_tag_set(&dev->tagset);
-	blk_put_queue(dev->admin_q);
+	if (dev->tagset.tags)
+		blk_mq_free_tag_set(&dev->tagset);
+	if (dev->admin_q)
+		blk_put_queue(dev->admin_q);
 	kfree(dev->queues);
 	kfree(dev->entry);
 	kfree(dev);
@@ -2865,6 +2876,9 @@ static int nvme_dev_start(struct nvme_dev *dev)
 
  free_tags:
 	nvme_dev_remove_admin(dev);
+	blk_put_queue(dev->admin_q);
+	dev->admin_q = NULL;
+	dev->queues[0]->tags = NULL;
  disable:
 	nvme_disable_queue(dev, 0);
 	nvme_dev_list_remove(dev);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] NVMe: Don't use fake status on cancelled command
  2015-06-08 16:08 [PATCH 0/3] NVMe: Initialization error handling fixups Keith Busch
  2015-06-08 16:08 ` [PATCH 1/3] NVMe: Fix device cleanup on initialization failure Keith Busch
@ 2015-06-08 16:08 ` Keith Busch
  2015-06-11 10:40   ` Christoph Hellwig
  2015-06-08 16:08 ` [PATCH 3/3] NVMe: Unify controller probe and resume Keith Busch
  2 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2015-06-08 16:08 UTC (permalink / raw)


Synchronized commands do different things for timed out commands
vs. controller returned errors.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/block/nvme-core.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 8513321..52742d4 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -613,7 +613,10 @@ static void req_completion(struct nvme_queue *nvmeq, void *ctx,
 			return;
 		}
 		if (req->cmd_type == REQ_TYPE_DRV_PRIV) {
-			req->errors = status;
+			if (cmd_rq->ctx == CMD_CTX_CANCELLED)
+				req->errors = -EINTR;
+			else
+				req->errors = status;
 		} else {
 			req->errors = nvme_error_status(status);
 		}
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] NVMe: Unify controller probe and resume
  2015-06-08 16:08 [PATCH 0/3] NVMe: Initialization error handling fixups Keith Busch
  2015-06-08 16:08 ` [PATCH 1/3] NVMe: Fix device cleanup on initialization failure Keith Busch
  2015-06-08 16:08 ` [PATCH 2/3] NVMe: Don't use fake status on cancelled command Keith Busch
@ 2015-06-08 16:08 ` Keith Busch
  2 siblings, 0 replies; 14+ messages in thread
From: Keith Busch @ 2015-06-08 16:08 UTC (permalink / raw)


This unifies probe and resume so they both may be scheduled in the same
way. This is necessary for error handling that may occur during device
initialization since the task to cleanup the device wouldn't be able to
run if it is blocked on device initialization.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/block/nvme-core.c |   52 +++++++++++++++++++++++++--------------------
 1 file changed, 29 insertions(+), 23 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 52742d4..326abff 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -2360,19 +2360,20 @@ static int nvme_dev_add(struct nvme_dev *dev)
 	}
 	kfree(ctrl);
 
-	dev->tagset.ops = &nvme_mq_ops;
-	dev->tagset.nr_hw_queues = dev->online_queues - 1;
-	dev->tagset.timeout = NVME_IO_TIMEOUT;
-	dev->tagset.numa_node = dev_to_node(dev->dev);
-	dev->tagset.queue_depth =
+	if (!dev->tagset.tags) {
+		dev->tagset.ops = &nvme_mq_ops;
+		dev->tagset.nr_hw_queues = dev->online_queues - 1;
+		dev->tagset.timeout = NVME_IO_TIMEOUT;
+		dev->tagset.numa_node = dev_to_node(dev->dev);
+		dev->tagset.queue_depth =
 				min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH) - 1;
-	dev->tagset.cmd_size = nvme_cmd_size(dev);
-	dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
-	dev->tagset.driver_data = dev;
-
-	if (blk_mq_alloc_tag_set(&dev->tagset))
-		return 0;
+		dev->tagset.cmd_size = nvme_cmd_size(dev);
+		dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
+		dev->tagset.driver_data = dev;
 
+		if (blk_mq_alloc_tag_set(&dev->tagset))
+			return 0;
+	}
 	schedule_work(&dev->scan_work);
 	return 0;
 }
@@ -2923,7 +2924,7 @@ static int nvme_dev_resume(struct nvme_dev *dev)
 		spin_unlock(&dev_list_lock);
 	} else {
 		nvme_unfreeze_queues(dev);
-		schedule_work(&dev->scan_work);
+		nvme_dev_add(dev);
 		nvme_set_irq_hints(dev);
 	}
 	return 0;
@@ -2931,8 +2932,17 @@ static int nvme_dev_resume(struct nvme_dev *dev)
 
 static void nvme_dev_reset(struct nvme_dev *dev)
 {
+	bool in_probe = work_busy(&dev->probe_work);
+
 	nvme_dev_shutdown(dev);
-	if (nvme_dev_resume(dev)) {
+
+	/* Synchronize with device probe so that work will see failure status
+	 * and exit gracefully without trying to schedule another reset */
+	flush_work(&dev->probe_work);
+
+	/* Fail this device if reset occured during probe to avoid
+	 * infinite initialization loops. */
+	if (in_probe) {
 		dev_warn(dev->dev, "Device failed to resume\n");
 		kref_get(&dev->kref);
 		if (IS_ERR(kthread_run(nvme_remove_dead_ctrl, dev, "nvme%d",
@@ -2941,7 +2951,11 @@ static void nvme_dev_reset(struct nvme_dev *dev)
 				"Failed to start controller remove task\n");
 			kref_put(&dev->kref, nvme_free_dev);
 		}
+		return;
 	}
+	/* Schedule device resume asynchronously so the reset work is available
+	 * to cleanup errors that may occur during reinitialization */
+	schedule_work(&dev->probe_work);
 }
 
 static void nvme_reset_failed_dev(struct work_struct *ws)
@@ -2973,6 +2987,7 @@ static int nvme_reset(struct nvme_dev *dev)
 
 	if (!ret) {
 		flush_work(&dev->reset_work);
+		flush_work(&dev->probe_work);
 		return 0;
 	}
 
@@ -3069,18 +3084,9 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 static void nvme_async_probe(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, probe_work);
-	int result;
-
-	result = nvme_dev_start(dev);
-	if (result)
-		goto reset;
 
-	if (dev->online_queues > 1)
-		result = nvme_dev_add(dev);
-	if (result)
+	if (nvme_dev_resume(dev))
 		goto reset;
-
-	nvme_set_irq_hints(dev);
 	return;
  reset:
 	spin_lock(&dev_list_lock);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] NVMe: Don't use fake status on cancelled command
  2015-06-08 16:08 ` [PATCH 2/3] NVMe: Don't use fake status on cancelled command Keith Busch
@ 2015-06-11 10:40   ` Christoph Hellwig
  2015-06-11 14:15     ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2015-06-11 10:40 UTC (permalink / raw)


On Mon, Jun 08, 2015@10:08:14AM -0600, Keith Busch wrote:
> Synchronized commands do different things for timed out commands
> vs. controller returned errors.

Can you explain this a bit more?  Can't find anything that checks for
the error value.  Let's hope the EINTR never leaks to userspace either
as it has a well defined and very different meaning there.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] NVMe: Don't use fake status on cancelled command
  2015-06-11 10:40   ` Christoph Hellwig
@ 2015-06-11 14:15     ` Keith Busch
  2015-06-11 15:23       ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2015-06-11 14:15 UTC (permalink / raw)


On Thu, 11 Jun 2015, Christoph Hellwig wrote:
> On Mon, Jun 08, 2015@10:08:14AM -0600, Keith Busch wrote:
>> Synchronized commands do different things for timed out commands
>> vs. controller returned errors.
>
> Can you explain this a bit more?  Can't find anything that checks for
> the error value.  Let's hope the EINTR never leaks to userspace either
> as it has a well defined and very different meaning there.

Yep, the first command issued is set features for number of queues. If
this returns a valid non-successful NVMe status, the driver believes
the controller is operational and can be managed, although no IO. If it
returns a status < 0, the controller is believed to be non-operational
and initialization will fail.

I used -EINTR to restore what used to be returned from the driver's
original implementation, but it doesn't have to be this error.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] NVMe: Don't use fake status on cancelled command
  2015-06-11 14:15     ` Keith Busch
@ 2015-06-11 15:23       ` Christoph Hellwig
       [not found]         ` <55935989.70809@huawei.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2015-06-11 15:23 UTC (permalink / raw)


On Thu, Jun 11, 2015@02:15:31PM +0000, Keith Busch wrote:
> Yep, the first command issued is set features for number of queues. If
> this returns a valid non-successful NVMe status, the driver believes
> the controller is operational and can be managed, although no IO. If it
> returns a status < 0, the controller is believed to be non-operational
> and initialization will fail.
> 
> I used -EINTR to restore what used to be returned from the driver's
> original implementation, but it doesn't have to be this error.

Seems like a few other places use -EINTR the same, so let's keep it
for now and maybe replace all of them with something more sensible
in another step.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Question about NVMe share I/O
       [not found]         ` <55935989.70809@huawei.com>
@ 2015-07-01 16:17           ` Keith Busch
  2015-07-01 16:45             ` James R. Bergsten
                               ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Keith Busch @ 2015-07-01 16:17 UTC (permalink / raw)


On Tue, 30 Jun 2015, dingxiang wrote:
> Hi,All
> We are now using NVMe to develop a share I/O model,the topology is
>
>   |------|        |------|
>   |Host A|        |Host B|
>   |______|        |______|
>       \              /
>        \            /
>         \ |------| /
>          \| nvme |/
>           |______|


I think I'm missing part of the picture here. Could you explain how
you managed to get two hosts to talk to a single nvme controller. More
specifically, how are they able to safely share the admin queue and the
pci-e function's nvme registers?


> We assign one queue for every host,
> here are the details of host A and B:
>
> Host A:
>  QID     :2
>  MSIX irq:117
>  cq prp1 :0x31253530000
>  sq prp1 :0x3124af30000
>
> Host B:
>  QID     :3
>  MSIX irq:118
>  cq prp1 :0x35252470000
>  sq prp1 :0x3524d820000
>
> Then we run test script in both hosts,the script is :
>  insmod nvme.ko
>  sleep 2
>  rmmod nvme
>  sleep 2
>
> When the script runs after a period of time,Host B will crash in function "nvme_process_cq",
> and Host A will print "I/O Buffer error" messages.
> We found when host B crash,the QID Host B processed is QID 2,and the command_id
> in struct "nvme_completion" is not the value allocate in Host B, but same as Host A ,
> the MSIX and prp value of host B are not change.
> My doubt is why Host B can receive Host A's nvmeq info? In my opinion,the queues of Host A and B
> are independent, should not interfere with each other.
> Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Question about NVMe share I/O
  2015-07-01 16:17           ` Question about NVMe share I/O Keith Busch
@ 2015-07-01 16:45             ` James R. Bergsten
  2015-07-02  7:11             ` dingxiang
  2015-07-02 12:42             ` Yijing Wang
  2 siblings, 0 replies; 14+ messages in thread
From: James R. Bergsten @ 2015-07-01 16:45 UTC (permalink / raw)


First guess is both hosts trying to share the admin queue?

-----Original Message-----
From: Linux-nvme [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf
Of Keith Busch
Sent: Wednesday, July 1, 2015 9:18 AM
To: dingxiang
Cc: Keith Busch; linux-nvme at lists.infradead.org
Subject: Re: Question about NVMe share I/O

On Tue, 30 Jun 2015, dingxiang wrote:
> Hi,All
> We are now using NVMe to develop a share I/O model,the topology is
>
>   |------|        |------|
>   |Host A|        |Host B|
>   |______|        |______|
>       \              /
>        \            /
>         \ |------| /
>          \| nvme |/
>           |______|


I think I'm missing part of the picture here. Could you explain how you
managed to get two hosts to talk to a single nvme controller. More
specifically, how are they able to safely share the admin queue and the
pci-e function's nvme registers?


> We assign one queue for every host,
> here are the details of host A and B:
>
> Host A:
>  QID     :2
>  MSIX irq:117
>  cq prp1 :0x31253530000
>  sq prp1 :0x3124af30000
>
> Host B:
>  QID     :3
>  MSIX irq:118
>  cq prp1 :0x35252470000
>  sq prp1 :0x3524d820000
>
> Then we run test script in both hosts,the script is :
>  insmod nvme.ko
>  sleep 2
>  rmmod nvme
>  sleep 2
>
> When the script runs after a period of time,Host B will crash in 
> function "nvme_process_cq", and Host A will print "I/O Buffer error"
messages.
> We found when host B crash,the QID Host B processed is QID 2,and the 
> command_id in struct "nvme_completion" is not the value allocate in 
> Host B, but same as Host A , the MSIX and prp value of host B are not
change.
> My doubt is why Host B can receive Host A's nvmeq info? In my 
> opinion,the queues of Host A and B are independent, should not interfere
with each other.
> Thanks!

_______________________________________________
Linux-nvme mailing list
Linux-nvme at lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Question about NVMe share I/O
  2015-07-01 16:17           ` Question about NVMe share I/O Keith Busch
  2015-07-01 16:45             ` James R. Bergsten
@ 2015-07-02  7:11             ` dingxiang
  2015-07-02 12:42             ` Yijing Wang
  2 siblings, 0 replies; 14+ messages in thread
From: dingxiang @ 2015-07-02  7:11 UTC (permalink / raw)




? 2015/7/2 0:17, Keith Busch wrote:
> On Tue, 30 Jun 2015, dingxiang wrote:
>> Hi,All
>> We are now using NVMe to develop a share I/O model,the topology is
>>
>>   |------|        |------|
>>   |Host A|        |Host B|
>>   |______|        |______|
>>       \              /
>>        \            /
>>         \ |------| /
>>          \| nvme |/
>>           |______|
> 
> 
> I think I'm missing part of the picture here. Could you explain how
> you managed to get two hosts to talk to a single nvme controller. More
> specifically, how are they able to safely share the admin queue and the
> pci-e function's nvme registers?
> 
> 
We create admin queue and an I/O queue(QID:1) in nvme server before Host A&B start,
host A&B send amdin commands to nvme controller through a PLX chip,then nvme controller
can handle admin commands that from host A&B,and there is no error in admin commands pci-e
messages, host A&B can create I/O queue and perform other amdin commands normally .
So Host A&B can share the admin queue and the pci-e function's nvme registers safely.

>> We assign one queue for every host,
>> here are the details of host A and B:
>>
>> Host A:
>>  QID     :2
>>  MSIX irq:117
>>  cq prp1 :0x31253530000
>>  sq prp1 :0x3124af30000
>>
>> Host B:
>>  QID     :3
>>  MSIX irq:118
>>  cq prp1 :0x35252470000
>>  sq prp1 :0x3524d820000
>>
>> Then we run test script in both hosts,the script is :
>>  insmod nvme.ko
>>  sleep 2
>>  rmmod nvme
>>  sleep 2
>>
>> When the script runs after a period of time,Host B will crash in function "nvme_process_cq",
>> and Host A will print "I/O Buffer error" messages.
>> We found when host B crash,the QID Host B processed is QID 2,and the command_id
>> in struct "nvme_completion" is not the value allocate in Host B, but same as Host A ,
>> the MSIX and prp value of host B are not change.
>> My doubt is why Host B can receive Host A's nvmeq info? In my opinion,the queues of Host A and B
>> are independent, should not interfere with each other.
>> Thanks!
> 
> .
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Question about NVMe share I/O
  2015-07-01 16:17           ` Question about NVMe share I/O Keith Busch
  2015-07-01 16:45             ` James R. Bergsten
  2015-07-02  7:11             ` dingxiang
@ 2015-07-02 12:42             ` Yijing Wang
  2015-07-02 14:42               ` Keith Busch
  2 siblings, 1 reply; 14+ messages in thread
From: Yijing Wang @ 2015-07-02 12:42 UTC (permalink / raw)


On 2015/7/2 0:17, Keith Busch wrote:
> On Tue, 30 Jun 2015, dingxiang wrote:
>> Hi,All
>> We are now using NVMe to develop a share I/O model,the topology is
>>
>>   |------|        |------|
>>   |Host A|        |Host B|
>>   |______|        |______|
>>       \              /
>>        \            /
>>         \ |------| /
>>          \| nvme |/
>>           |______|
> 
> 
> I think I'm missing part of the picture here. Could you explain how
> you managed to get two hosts to talk to a single nvme controller. More
> specifically, how are they able to safely share the admin queue and the
> pci-e function's nvme registers?

Hi Keith, it's not a traditional topology, the physical NVMe device is located in
a manager OS which is independent of other hosts.  Every Host connects to manager OS
by some PCIe interconnect topology(something like NTB bridge). All hosts share the admin
queue which is created in manager OS, so if host want to deliver a admin command to nvme controller,
it would first send the admin command to the manager OS, then manger OS would post the admin command
to nvme controller instead.  Thanks to the PCIe interconnect fabric ,every Host could exclusively
occupy several NVMe IO queues which bypass the manager OS, the DMA packet could be routed to
correct Host by the PCIe interconnect fabric.

In our test case, we have two host A and B, and a manager OS, Manger OS occupy the admin queue and first IO
queue(id = 1), Host A occupy IO queue 2 and 3, Host B occupy IO queue 4 and 5. Every IO queue has its own
completion queue.

Most of the time, the Host and NVMe work fine, we could read/write the same nvme by different Host,
but if we do test which insmod and rmmod nvme driver(we reworked) in both hosts, a system crash would happen,
and the root cause is in Host B we receive a completion which does not belong to it, we found it belong to
Host A, because submit queue id in completion is 2. It's so strange, according to NVMe spec, I think
every IO queue should independent.

So is there some possibility a NVMe completion would deliver to another IO queue completion queue ?

Thanks!
Yijing.


> 
> 
>> We assign one queue for every host,
>> here are the details of host A and B:
>>
>> Host A:
>>  QID     :2
>>  MSIX irq:117
>>  cq prp1 :0x31253530000
>>  sq prp1 :0x3124af30000
>>
>> Host B:
>>  QID     :3
>>  MSIX irq:118
>>  cq prp1 :0x35252470000
>>  sq prp1 :0x3524d820000
>>
>> Then we run test script in both hosts,the script is :
>>  insmod nvme.ko
>>  sleep 2
>>  rmmod nvme
>>  sleep 2
>>
>> When the script runs after a period of time,Host B will crash in function "nvme_process_cq",
>> and Host A will print "I/O Buffer error" messages.
>> We found when host B crash,the QID Host B processed is QID 2,and the command_id
>> in struct "nvme_completion" is not the value allocate in Host B, but same as Host A ,
>> the MSIX and prp value of host B are not change.
>> My doubt is why Host B can receive Host A's nvmeq info? In my opinion,the queues of Host A and B
>> are independent, should not interfere with each other.
>> Thanks!
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 
> 


-- 
Thanks!
Yijing

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Question about NVMe share I/O
  2015-07-02 12:42             ` Yijing Wang
@ 2015-07-02 14:42               ` Keith Busch
  2015-07-03  1:24                 ` Yijing Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2015-07-02 14:42 UTC (permalink / raw)


On Thu, 2 Jul 2015, Yijing Wang wrote:
> Most of the time, the Host and NVMe work fine, we could read/write the same
> nvme by different Host, but if we do test which insmod and rmmod nvme
> driver(we reworked) in both hosts, a system crash would happen, Host A,
> because submit queue id in completion is 2.

Could you share the source to your "reworked" driver?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Question about NVMe share I/O
  2015-07-02 14:42               ` Keith Busch
@ 2015-07-03  1:24                 ` Yijing Wang
  2015-07-08  8:49                   ` dingxiang
  0 siblings, 1 reply; 14+ messages in thread
From: Yijing Wang @ 2015-07-03  1:24 UTC (permalink / raw)


On 2015/7/2 22:42, Keith Busch wrote:
> On Thu, 2 Jul 2015, Yijing Wang wrote:
>> Most of the time, the Host and NVMe work fine, we could read/write the same
>> nvme by different Host, but if we do test which insmod and rmmod nvme
>> driver(we reworked) in both hosts, a system crash would happen, Host A,
>> because submit queue id in completion is 2.
> 
> Could you share the source to your "reworked" driver?
> 
> 

It has a lot changes, I diff the new reworked driver and the default nvme driver in linux 3.10

The main changes focus on following:

1?Private DMA alloc functions, we use it to alloc dma resource, and its address is global across the hosts,
   so when NVMe controller try to transmit DMA packets, the PCIe interconnect fabric could route the dma to
   correct host by its dma address;

2?Private MSI-X enable functions, because default msix setup the local msi-x address, so we need to update it
   to global dma address, which just add a global address offset.

3?Use non-used NVMe bar4 as the communication way to map host nvme admin queue to manager OS, so we could pass
   the host admin command to manager OS, and manager OS is responsible for delivering the admin command to physical
   nvme controller;

4?Request nvme IO queues from manger OS which return the free IO queue id, then host would request to create the allocated IO queues.



[yijing at localhost linux-3.10-new]$ diff drivers/block/nvme-host.c drivers/block/nvme-core.c
44d43
< #include <linux/msi.h>
61d59
< static dma_addr_t plx_dma_addr_offset = 0;
63,64d60
< static void nvme_post_admin_cmd(struct nvme_dev *dev,
< 		struct nvme_command *c);
90,163d85
< static void * nvme_dma_alloc_coherent(struct device *dev,
< 		size_t size, dma_addr_t *dma_handle, gfp_t gfp)
< {
< 	void *mem;
<
< 	mem = dma_alloc_coherent(dev, size, dma_handle, gfp);
< 	/* Add dma address offset for nvme device in the host side */
< 	*dma_handle += plx_dma_addr_offset;
< 	return mem;
< }
<
< static void nvme_dma_free_coherent(struct device *dev,
< 		size_t size, void *vaddr, dma_addr_t bus)
< {
< 	dma_free_coherent(dev, size, vaddr, bus - plx_dma_addr_offset);
< }
<
< static int nvme_dma_map_sg(struct device *dev, struct scatterlist *sg,
< 		int nents, enum dma_data_direction dir)
< {
< 	int result, i;
< 	struct scatterlist *s;
< 	
< 	result = dma_map_sg(dev, sg, nents, dir);
< 	for_each_sg(sg, s, nents, i)
< 		s->dma_address += plx_dma_addr_offset;
< 	
< 	return result;
< }
<
< static void nvme_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
< 		int nents, enum dma_data_direction dir)
< {
< 	int i;
< 	struct scatterlist *s;
< 	
< 	for_each_sg(sg, s, nents, i)
< 		s->dma_address -= plx_dma_addr_offset;
<
< 	dma_unmap_sg(dev, sg, nents, dir);
< }
<
< /* NVMe private MSI interfaces */
< static int nvme_enable_msix(struct nvme_dev *dev, int nvec)
< {
< 	int ret;
< 	void __iomem *base;
< 	struct msi_desc *entry;
<
< 	ret = pci_enable_msix(dev->pci_dev, dev->entry, nvec);
< 	if (!ret) {
< 		list_for_each_entry(entry, &dev->pci_dev->msi_list, list) {
< 			base = entry->mask_base +
< 				entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
<
< 			entry->msg.address_lo += plx_dma_addr_offset & 0xffffffff;
< 			entry->msg.address_hi += plx_dma_addr_offset >> 32;
<
< 			mask_msix_entry(dev->pci_dev, entry->msi_attrib.entry_nr,
< 					entry->mask_base + entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE
< 					+ PCI_MSIX_ENTRY_VECTOR_CTRL, 1);
< 			/* Flush the updated MSI address */
< 			writel(entry->msg.address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
< 			writel(entry->msg.address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
<
< 			mask_msix_entry(dev->pci_dev, entry->msi_attrib.entry_nr,
< 					entry->mask_base + entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE
< 					+ PCI_MSIX_ENTRY_VECTOR_CTRL, 0);
< 		}	
< 	}
<
< 	return ret;
< }
<
289d210
< 	pr_info("%s: nvmeq %p, free cmdid %d\n", __func__, nvmeq, cmdid);
308c229
< 	return dev->queues[get_cpu()];
---
> 	return dev->queues[get_cpu() + 1];
398c319
< 		nvme_dma_unmap_sg(&dev->pci_dev->dev, iod->sg, iod->nents,
---
> 		dma_unmap_sg(&dev->pci_dev->dev, iod->sg, iod->nents,
630c551
< 	if (nvme_dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir) == 0)
---
> 	if (dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir) == 0)
699d619
< 	pr_info("%s: nvmeq %p, alloc cmdid %d\n", __func__, nvmeq, cmdid);
734d653
< 	pr_info("%s: nvmeq %p, alloc cmdid %d\n", __func__, nvmeq, cmdid);
832,837d750
< 		if (!fn) {
< 			pr_err("%s: nvmeq %p, result %d, sq_head %d,"
< 					"sq_id %d, command id %d, status %d\n",
< 					__func__, nvmeq, cqe.result, cqe.sq_head,
< 					cqe.sq_id, cqe.command_id, cqe.status);
< 		}
915d827
< 	pr_info("%s: nvmeq %p, alloc cmdid %d\n", __func__, nvmeq, cmdid);
941c853
< 	u32 val;
---
> 	int status;
948,950c860,862
< 	nvme_post_admin_cmd(dev, &c);
< 	writel(NVME_DATA_VALID, dev->bar4);
< 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
---
> 	status = nvme_submit_admin_cmd(dev, &c, NULL);
> 	if (status)
> 		return -EIO;
956a869
> 	int status;
959d871
< 	u32 val;
968,972c880,883
< 	pr_debug("%s: cq qid %d, prp1 0x%llx, vector %d\n",
< 			__func__, qid, nvmeq->cq_dma_addr, nvmeq->cq_vector);
< 	nvme_post_admin_cmd(dev, &c);
< 	writel(NVME_DATA_VALID, dev->bar4);
< 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
---
>
> 	status = nvme_submit_admin_cmd(dev, &c, NULL);
> 	if (status)
> 		return -EIO;
978a890
> 	int status;
981d892
< 	u32 val;
991,995c902,904
< 	pr_debug("%s: sq qid %d, prp1 0x%llx\n",
< 			__func__, qid, nvmeq->sq_dma_addr);
< 	nvme_post_admin_cmd(dev, &c);
< 	writel(NVME_DATA_VALID, dev->bar4);
< 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
---
> 	status = nvme_submit_admin_cmd(dev, &c, NULL);
> 	if (status)
> 		return -EIO;
1009,1083d917
< static void nvme_post_admin_cmd(struct nvme_dev *dev,
< 		struct nvme_command *c)
< {
< 	int i;
< 	u32 *addr = (u32 *)c;
<
< 	/* nvme admin command is always 64 bytes */
< 	for (i = 0; i < 16; i++) {
< 		writel(*addr, dev->bar4 + 8 + i * 4);
< 		addr++;
< 	}
< 	/* Tag it's a nvme admin command */
< 	writel(NVME_ADMIN_CMD, dev->bar4 + 4);
< }
<
< static int nvme_recv_data_back(struct nvme_dev *dev,
< 		void *mem, size_t size)
< {
< 	u32 *addr = (u32 *)mem;
< 	int count = 30, i;
< 	u32 val;
<
< 	while (count--) {
< 		if (readl(dev->bar4 + NVME_RETURN_OFFSET) ==
< 				NVME_RETURN_READY) {
< 			writel(0x0, dev->bar4 + NVME_RETURN_OFFSET);
<
< 			val = readl(dev->bar4 + NVME_RETURN_OFFSET + 4);
< 			writel(0x0, dev->bar4 + NVME_RETURN_OFFSET + 4);
< 			if (val) {
< 				/* admin process fail */
< 				dev_warn(&dev->pci_dev->dev,
< 						"admin command fail\n");
< 				return val;
< 			}
<
< 			for (i = 0; i < size; i += 4) {
< 				*addr = readl(dev->bar4 + NVME_RETURN_OFFSET + 8 + i);
< 				addr++;
< 			}
< 			break;
< 		}
< 		msleep(10);
< 	}
< 	
< 	if (!count) {
< 		dev_warn(&dev->pci_dev->dev,
< 				"recv admin command data back timeout\n");
< 		return -1;
< 	}
<
< 	return 0;
< }
<
< int nvme_identify_plx(struct nvme_dev *dev, unsigned nsid, unsigned cns,
< 							void *mem)
< {
< 	struct nvme_command c;
< 	int val;
<
< 	memset(&c, 0, sizeof(c));
< 	c.identify.opcode = nvme_admin_identify;
< 	c.identify.nsid = cpu_to_le32(nsid);
< 	/* prp1 is not necessary, it will be replaced
< 	 * with MCPU dma address in PLX MGR
< 	 */
< 	c.identify.prp1 = cpu_to_le64(0x12345678);
< 	c.identify.cns = cpu_to_le32(cns);
< 	
< 	nvme_post_admin_cmd(dev, &c);
< 	writel(NVME_DATA_VALID, dev->bar4);
< 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
< 	return nvme_recv_data_back(dev, mem, 4096);
< }
<
1112,1133d945
< int nvme_get_features_plx(struct nvme_dev *dev, unsigned fid, unsigned nsid,
< 					void *mem, u32 *result)
< {
< 	struct nvme_command c;
< 	int val;
<
< 	memset(&c, 0, sizeof(c));
< 	c.features.opcode = nvme_admin_get_features;
< 	c.features.nsid = cpu_to_le32(nsid);
< 	/* prp1 is not necessary, it will be replaced
< 	 * with MCPU dma address in PLX MGR, so
< 	 * 0x12345678 is meaningless here.
< 	 */
< 	c.features.prp1 = cpu_to_le64(0x12345678);
< 	c.features.fid = cpu_to_le32(fid);
< 	nvme_post_admin_cmd(dev, &c);
< 	writel(NVME_DATA_VALID, dev->bar4);
< 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
<
< 	return nvme_recv_data_back(dev, mem, 4096);
< }
<
1138d949
< 	u32 val;
1146,1150c957
< 	nvme_post_admin_cmd(dev, &c);
< 	writel(NVME_DATA_VALID, dev->bar4);
< 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
<
< 	return nvme_recv_data_back(dev, result, 4);
---
> 	return nvme_submit_admin_cmd(dev, &c, result);
1184c991
< 	nvme_dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
---
> 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
1186c993
< 	nvme_dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
---
> 	dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
1194c1001
< 	int vector = dev->entry[nvmeq->cq_vector - dev->first + 1].vector;
---
> 	int vector = dev->entry[nvmeq->cq_vector].vector;
1207,1209c1014,1018
< 	/* Hosts don't have admin queue,the IO queues' index are from 0 */
< 	adapter_delete_sq(dev, qid + dev->first);
< 	adapter_delete_cq(dev, qid + dev->first);
---
> 	/* Don't tell the adapter to delete the admin queue */
> 	if (qid) {
> 		adapter_delete_sq(dev, qid);
> 		adapter_delete_cq(dev, qid);
> 	}
1224c1033
< 	nvmeq->cqes = nvme_dma_alloc_coherent(dmadev, CQ_SIZE(depth),
---
> 	nvmeq->cqes = dma_alloc_coherent(dmadev, CQ_SIZE(depth),
1230c1039
< 	nvmeq->sq_cmds = nvme_dma_alloc_coherent(dmadev, SQ_SIZE(depth),
---
> 	nvmeq->sq_cmds = dma_alloc_coherent(dmadev, SQ_SIZE(depth),
1250c1059
< 	nvme_dma_free_coherent(dmadev, CQ_SIZE(depth), (void *)nvmeq->cqes,
---
> 	dma_free_coherent(dmadev, CQ_SIZE(depth), (void *)nvmeq->cqes,
1265,1266c1074,1075
< 	return request_irq(dev->entry[nvmeq->cq_vector - dev->first + 1].vector,
< 			nvme_irq, IRQF_DISABLED | IRQF_SHARED, name, nvmeq);
---
> 	return request_irq(dev->entry[nvmeq->cq_vector].vector, nvme_irq,
> 				IRQF_DISABLED | IRQF_SHARED, name, nvmeq);
1297c1106
< 	nvme_dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
---
> 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
1299c1108
< 	nvme_dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
---
> 	dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
1305d1113
< #if 0
1346c1154,1200
< #endif
---
>
> static int nvme_configure_admin_queue(struct nvme_dev *dev)
> {
> 	int result;
> 	u32 aqa;
> 	u64 cap = readq(&dev->bar->cap);
> 	struct nvme_queue *nvmeq;
>
> 	dev->dbs = ((void __iomem *)dev->bar) + 4096;
> 	dev->db_stride = NVME_CAP_STRIDE(cap);
>
> 	result = nvme_disable_ctrl(dev, cap);
> 	if (result < 0)
> 		return result;
>
> 	nvmeq = nvme_alloc_queue(dev, 0, 64, 0);
> 	if (!nvmeq)
> 		return -ENOMEM;
>
> 	aqa = nvmeq->q_depth - 1;
> 	aqa |= aqa << 16;
>
> 	dev->ctrl_config = NVME_CC_ENABLE | NVME_CC_CSS_NVM;
> 	dev->ctrl_config |= (PAGE_SHIFT - 12) << NVME_CC_MPS_SHIFT;
> 	dev->ctrl_config |= NVME_CC_ARB_RR | NVME_CC_SHN_NONE;
> 	dev->ctrl_config |= NVME_CC_IOSQES | NVME_CC_IOCQES;
>
> 	writel(aqa, &dev->bar->aqa);
> 	writeq(nvmeq->sq_dma_addr, &dev->bar->asq);
> 	writeq(nvmeq->cq_dma_addr, &dev->bar->acq);
> 	writel(dev->ctrl_config, &dev->bar->cc);
>
> 	result = nvme_enable_ctrl(dev, cap);
> 	if (result)
> 		goto free_q;
>
> 	result = queue_request_irq(dev, nvmeq, "nvme admin");
> 	if (result)
> 		goto free_q;
>
> 	dev->queues[0] = nvmeq;
> 	return result;
>
>  free_q:
> 	nvme_free_queue_mem(nvmeq);
> 	return result;
> }
1388c1242
< 	nents = nvme_dma_map_sg(&dev->pci_dev->dev, sg, count,
---
> 	nents = dma_map_sg(&dev->pci_dev->dev, sg, count,
1410c1264
< 	nvme_dma_unmap_sg(&dev->pci_dev->dev, iod->sg, iod->nents,
---
> 	dma_unmap_sg(&dev->pci_dev->dev, iod->sg, iod->nents,
1470c1324
< 		meta_mem = nvme_dma_alloc_coherent(&dev->pci_dev->dev, meta_len,
---
> 		meta_mem = dma_alloc_coherent(&dev->pci_dev->dev, meta_len,
1522c1376
< 		nvme_dma_free_coherent(&dev->pci_dev->dev, meta_len, meta_mem,
---
> 		dma_free_coherent(&dev->pci_dev->dev, meta_len, meta_mem,
1771c1625
< static int get_queue_info(struct nvme_dev *dev, int *start, int count)
---
> static int set_queue_count(struct nvme_dev *dev, int count)
1773,1788c1627,1629
< 	u32 result = 0, val;
< 	int c = 10;
<
< 	writel(NVME_IOQ_INFO, dev->bar4 + 4);
< 	writel(NVME_DATA_VALID, dev->bar4);
< 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
<
< 	while (c--) {
< 		if (readl(dev->bar4 + NVME_RETURN_OFFSET) ==
< 				NVME_RETURN_READY) {
< 			result = readl(dev->bar4 + NVME_RETURN_OFFSET + 8);
< 			writel(0x0, dev->bar4 + NVME_RETURN_OFFSET);
< 			break;
< 		}
< 		msleep(10);
< 	}
---
> 	int status;
> 	u32 result;
> 	u32 q_count = (count - 1) | ((count - 1) << 16);
1790,1795c1631,1635
< 	/*
< 	 * MCPU would save the start IO queue number in high 16 bits
< 	 * the IO queue number is saved in low 16 bits
< 	 */
< 	*start = result >> 16;
< 	return (result & 0xffff);
---
> 	status = nvme_set_features(dev, NVME_FEAT_NUM_QUEUES, q_count, 0,
> 								&result);
> 	if (status)
> 		return -EIO;
> 	return min(result & 0xffff, result >> 16) + 1;
1801,1802c1641
< 	int result, first, cpu, i, nr_io_queues;
< 	int db_bar_size, q_depth;
---
> 	int result, cpu, i, nr_io_queues, db_bar_size, q_depth, q_count;
1805,1806c1644
< 	/* "first" is the first io queue id allocated */
< 	result = get_queue_info(dev, &first, nr_io_queues);
---
> 	result = set_queue_count(dev, nr_io_queues);
1809,1810d1646
< 	if (result == 0 || first == 0)
< 		return -EPERM;
1813,1815c1649,1654
< 	
< 	dev->first = first;
< 	db_bar_size = 4096 + ((first + nr_io_queues) << (dev->db_stride + 3));
---
>
> 	q_count = nr_io_queues;
> 	/* Deregister the admin queue's interrupt */
> 	free_irq(dev->entry[0].vector, dev->queues[0]);
>
> 	db_bar_size = 4096 + ((nr_io_queues + 1) << (dev->db_stride + 3));
1823,1827d1661
< 	/*
< 	 * Admin queue and first io queue share the MSI-X irq
< 	 * in MCPU, so if io queue id is x, its related vector
< 	 * should be x-1.
< 	 */
1829c1663
< 		dev->entry[i].entry = i + first - 1;
---
> 		dev->entry[i].entry = i;
1831c1665
< 		result = nvme_enable_msix(dev, nr_io_queues);
---
> 		result = pci_enable_msix(pdev, dev->entry, nr_io_queues);
1842a1677,1697
> 	if (nr_io_queues == 0) {
> 		nr_io_queues = q_count;
> 		for (;;) {
> 			result = pci_enable_msi_block(pdev, nr_io_queues);
> 			if (result == 0) {
> 				for (i = 0; i < nr_io_queues; i++)
> 					dev->entry[i].vector = i + pdev->irq;
> 				break;
> 			} else if (result > 0) {
> 				nr_io_queues = result;
> 				continue;
> 			} else {
> 				nr_io_queues = 1;
> 				break;
> 			}
> 		}
> 	}
>
> 	result = queue_request_irq(dev, dev->queues[0], "nvme admin");
> 	/* XXX: handle failure here */
>
1852,1855c1707,1709
< 		dev->queues[i] = nvme_create_queue(dev, i + first, q_depth,
< 				i + first -1);
< 		if (IS_ERR(dev->queues[i]))
< 			return PTR_ERR(dev->queues[i]);
---
> 		dev->queues[i + 1] = nvme_create_queue(dev, i + 1, q_depth, i);
> 		if (IS_ERR(dev->queues[i + 1]))
> 			return PTR_ERR(dev->queues[i + 1]);
1860,1861c1714,1715
< 		int target = i % rounddown_pow_of_two(dev->queue_count);
< 		dev->queues[i] = dev->queues[target];
---
> 		int target = i % rounddown_pow_of_two(dev->queue_count - 1);
> 		dev->queues[i + 1] = dev->queues[target + 1];
1887a1742
> 	dma_addr_t dma_addr;
1894c1749,1750
< 	mem = kzalloc(8192, GFP_KERNEL);
---
> 	mem = dma_alloc_coherent(&dev->pci_dev->dev, 8192, &dma_addr,
> 								GFP_KERNEL);
1898c1754
< 	res = nvme_identify_plx(dev, 0, 1, mem);
---
> 	res = nvme_identify(dev, 0, 1, dma_addr);
1918c1774
< 		res = nvme_identify_plx(dev, i, 0, mem);
---
> 		res = nvme_identify(dev, i, 0, dma_addr);
1925,1926c1781,1782
< 		res = nvme_get_features_plx(dev, NVME_FEAT_LBA_RANGE, i,
< 							mem + 4096, NULL);
---
> 		res = nvme_get_features(dev, NVME_FEAT_LBA_RANGE, i,
> 							dma_addr + 4096, NULL);
1939c1795
< 	kfree(mem);
---
> 	dma_free_coherent(&dev->pci_dev->dev, 8192, mem, dma_addr);
2049,2074d1904
< static void nvme_update_dma_addr_offset(struct nvme_dev *dev)
< {
< 	int val, count = 10;
< 	u64	result = 0;
<
< 	if (plx_dma_addr_offset)
< 		return;
<
< 	writel(NVME_DMA_ADDR_OFFSET, dev->bar4 + 4);
< 	writel(NVME_DATA_VALID, dev->bar4);
< 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
< 	while (count--) {
< 		if (readl(dev->bar4 + NVME_RETURN_OFFSET) ==
< 				NVME_RETURN_READY) {
<
< 			result = readq(dev->bar4 + NVME_RETURN_OFFSET + 8);
< 			writel(0x0, dev->bar4 + NVME_RETURN_OFFSET);
< 			break;
< 		}
< 		msleep(10);
< 	}
<
< 	dev_info(&dev->pci_dev->dev, "PLX dma addr offset: 0x%llx\n", result);
< 	plx_dma_addr_offset = result;
< }
<
2098,2112c1928
< 	struct pci_dev *tmp_dev = NULL;
< 	u64 cap;
< 	int flag = 0;	
<
< 	pdev->nvme = 1;
< 	for_each_pci_dev(tmp_dev){
< 		if(tmp_dev->device == 0x1009){
< 			flag = 1;
< 			break;
< 		}
< 	}
< 	if(flag)
< 		return 0;
< 	if(pdev->bus->self->device != 0x9797)
< 		return 0;	
---
>
2120c1936
< 	dev->queues = kcalloc(num_possible_cpus(), sizeof(void *),
---
> 	dev->queues = kcalloc(num_possible_cpus() + 1, sizeof(void *),
2146a1963,1964
> 	dev->entry[0].vector = pdev->irq;
>
2157,2159c1975,1976
< 	dev->bar4 = ioremap(pci_resource_start(pdev, 4), 16 *1024);
< 	if (!dev->bar4) {
< 		result = -ENOMEM;
---
> 	result = nvme_configure_admin_queue(dev);
> 	if (result)
2161,2167c1978
< 	}
<
< 	nvme_update_dma_addr_offset(dev);
<
< 	cap = readq(&dev->bar->cap);
< 	dev->dbs = ((void __iomem *)dev->bar) + 4096;
< 	dev->db_stride = NVME_CAP_STRIDE(cap);
---
> 	dev->queue_count++;
2199d2009
< 	iounmap(dev->bar4);
[yijing at localhost linux-3.10-new]$


-- 
Thanks!
Yijing

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Question about NVMe share I/O
  2015-07-03  1:24                 ` Yijing Wang
@ 2015-07-08  8:49                   ` dingxiang
  0 siblings, 0 replies; 14+ messages in thread
From: dingxiang @ 2015-07-08  8:49 UTC (permalink / raw)


Hi Keith,
   There is a simple model can duplicate this issue,here is the diff,and these changes are based on 3.10 kernel driver:

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index ce79a59..9791459 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -80,6 +80,7 @@ struct nvme_queue {
        u16 sq_tail;
        u16 cq_head;
        u16 cq_phase;
+       u16 qid;
        unsigned long cmdid_data[];
 };

@@ -748,6 +749,8 @@ static irqreturn_t nvme_process_cq(struct nvme_queue *nvmeq)
                }

                ctx = free_cmdid(nvmeq, cqe.command_id, &fn);
+                if(!fn)
+                       printk("cmdid:%d sq_id:%d ,nvmeqid:%d\n",cqe.command_id,cqe.sq_id,nvmeq->qid);
                fn(nvmeq->dev, ctx, &cqe);
        }

@@ -1095,7 +1098,8 @@ static struct nvme_queue *nvme_create_queue(struct nvme_dev *dev, int qid,
        result = queue_request_irq(dev, nvmeq, "nvme");
        if (result < 0)
                goto release_sq;
-
+
+       nvmeq->qid = qid;
        return nvmeq;

  release_sq:
@@ -1703,13 +1707,14 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)

        q_depth = min_t(int, NVME_CAP_MQES(readq(&dev->bar->cap)) + 1,
                                                                NVME_Q_DEPTH);
-       for (i = 0; i < nr_io_queues; i++) {
+       for (i = 1; i < nr_io_queues; i++) {
                dev->queues[i + 1] = nvme_create_queue(dev, i + 1, q_depth, i);
                if (IS_ERR(dev->queues[i + 1]))
                        return PTR_ERR(dev->queues[i + 1]);
                dev->queue_count++;
        }

+       dev->queues[1]=dev->queues[2];
        for (; i < num_possible_cpus(); i++) {
                int target = i % rounddown_pow_of_two(dev->queue_count - 1);
                dev->queues[i + 1] = dev->queues[target + 1];

In some cases,we need to reserve first several queues, so we will not create queues from qid 1.
Here is the crash log,from the log, we can find the qid 7 in cqe is different from qid 8 in nvmeq,
and this will cause system crash.

[  150.618085] cmdid:0 sq_id:7 ,nvmeqid:8
[  150.621821] BUG: unable to handle kernel NULL pointer dereference at
  (null)
[  150.629628] IP: [<          (null)>]           (null)
[  150.634660] PGD 0
[  150.636668] Oops: 0010 [#1] SMP
[  150.639895] Modules linked in: nvme(OF+) nf_conntrack_netbios_ns nf_conntrack
_broadcast ipt_MASQUERADE ip6t_REJECT bnep xt_conntrack bluetooth rfkill ebtable
_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_connt
rack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_
raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_n
at_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw mperf co
retemp kvm_intel kvm crc32_pclmul crc32c_intel ghash_clmulni_intel iTCO_wdt iTCO
_vendor_support microcode serio_raw tg3 pcspkr ptp mei_me ioatdma ses pps_core m
ei lpc_ich enclosure i2c_i801 mfd_core i2c_core shpchp wmi dca nfsd auth_rpcgss
nfs_acl lockd sunrpc isci libsas megaraid_sas scsi_transport_sas [last unloaded:
 nvme]
[  150.710416] CPU: 6 PID: 0 Comm: swapper/6 Tainted: GF          O 3.10.68+ #1
[  150.717420] Hardware name: Huawei Technologies Co., Ltd. RH2285H V2-24S/BC11S
RSF1, BIOS RMIBV399 12/15/2014
[  150.727098] task: ffff880c3cd88000 ti: ffff880c3cd82000 task.ti: ffff880c3cd8
2000
[  150.734531] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (n
ull)
[  150.741978] RSP: 0018:ffff88184ee03e50  EFLAGS: 00010082
[  150.747253] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000000000000083f

[  150.754342] RDX: ffff88184ee03e68 RSI: 0000000000000000 RDI: ffff8818337f9000

[  150.761431] RBP: ffff88184ee03ea0 R08: 0000000000000000 R09: 00000000000005d7

[  150.768521] R10: 0000000000000000 R11: ffff88184ee03b96 R12: 0000000000000001

[  150.775610] R13: 0000000000000001 R14: ffff881836518000 R15: 0000000000000010

[  150.782700] FS:  0000000000000000(0000) GS:ffff88184ee00000(0000) knlGS:00000
00000000000
[  150.790738] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  150.796446] CR2: 0000000000000000 CR3: 0000000001c0c000 CR4: 00000000001407e0

[  150.803536] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[  150.810625] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[  150.817714] Stack:
[  150.819711]  ffffffffa03da49c 0000000000000000 0000000000000000 0000000000000
000
[  150.827098]  0001000000070001 ffff881836518000 000000000000007b 0000000000000
07b
[  150.834485]  0000000000000000 ffffffff81c921b8 ffff88184ee03eb8 ffffffffa03da
5ae
[  150.841872] Call Trace:
[  150.844302]  <IRQ>
[  150.846214]  [<ffffffffa03da49c>] ? nvme_process_cq+0x9c/0x190 [nvme]
[  150.852812]  [<ffffffffa03da5ae>] nvme_irq+0x1e/0x30 [nvme]
[  150.858354]  [<ffffffff810e8fde>] handle_irq_event_percpu+0x3e/0x1e0
[  150.864668]  [<ffffffff810e91b6>] handle_irq_event+0x36/0x60
[  150.870293]  [<ffffffff810ebb4f>] handle_edge_irq+0x6f/0x120
[  150.875919]  [<ffffffff810133df>] handle_irq+0xbf/0x150
[  150.881114]  [<ffffffff8163d06a>] ? atomic_notifier_call_chain+0x1a/0x20
[  150.887773]  [<ffffffff8164331d>] do_IRQ+0x4d/0xc0
[  150.892535]  [<ffffffff8163922d>] common_interrupt+0x6d/0x6d
[  150.898157]  <EOI>
[  150.900068]  [<ffffffff814dead2>] ? cpuidle_enter_state+0x52/0xc0
[  150.906319]  [<ffffffff814dec09>] cpuidle_idle_call+0xc9/0x210
[  150.912117]  [<ffffffff8101a21e>] arch_cpu_idle+0xe/0x30
[  150.917395]  [<ffffffff810aa455>] cpu_startup_entry+0xe5/0x280
[  150.923193]  [<ffffffff81628276>] start_secondary+0x253/0x255
[  150.928901] Code:  Bad RIP value.
[  150.932215] RIP  [<          (null)>]           (null)
[  150.937328]  RSP <ffff88184ee03e50>
[  150.940792] CR2: 0000000000000000
[  150.944085] ---[ end trace 302912b189a4225c ]---
[  150.996202] Kernel panic - not syncing: Fatal exception in interrupt
[  152.061661] Shutting down cpus with NMI





> On 2015/7/2 22:42, Keith Busch wrote:
>> On Thu, 2 Jul 2015, Yijing Wang wrote:
>>> Most of the time, the Host and NVMe work fine, we could read/write the same
>>> nvme by different Host, but if we do test which insmod and rmmod nvme
>>> driver(we reworked) in both hosts, a system crash would happen, Host A,
>>> because submit queue id in completion is 2.
>>
>> Could you share the source to your "reworked" driver?
>>
>>
> 
> It has a lot changes, I diff the new reworked driver and the default nvme driver in linux 3.10
> 
> The main changes focus on following:
> 
> 1?Private DMA alloc functions, we use it to alloc dma resource, and its address is global across the hosts,
>    so when NVMe controller try to transmit DMA packets, the PCIe interconnect fabric could route the dma to
>    correct host by its dma address;
> 
> 2?Private MSI-X enable functions, because default msix setup the local msi-x address, so we need to update it
>    to global dma address, which just add a global address offset.
> 
> 3?Use non-used NVMe bar4 as the communication way to map host nvme admin queue to manager OS, so we could pass
>    the host admin command to manager OS, and manager OS is responsible for delivering the admin command to physical
>    nvme controller;
> 
> 4?Request nvme IO queues from manger OS which return the free IO queue id, then host would request to create the allocated IO queues.
> 
> 
> 
> [yijing at localhost linux-3.10-new]$ diff drivers/block/nvme-host.c drivers/block/nvme-core.c
> 44d43
> < #include <linux/msi.h>
> 61d59
> < static dma_addr_t plx_dma_addr_offset = 0;
> 63,64d60
> < static void nvme_post_admin_cmd(struct nvme_dev *dev,
> < 		struct nvme_command *c);
> 90,163d85
> < static void * nvme_dma_alloc_coherent(struct device *dev,
> < 		size_t size, dma_addr_t *dma_handle, gfp_t gfp)
> < {
> < 	void *mem;
> <
> < 	mem = dma_alloc_coherent(dev, size, dma_handle, gfp);
> < 	/* Add dma address offset for nvme device in the host side */
> < 	*dma_handle += plx_dma_addr_offset;
> < 	return mem;
> < }
> <
> < static void nvme_dma_free_coherent(struct device *dev,
> < 		size_t size, void *vaddr, dma_addr_t bus)
> < {
> < 	dma_free_coherent(dev, size, vaddr, bus - plx_dma_addr_offset);
> < }
> <
> < static int nvme_dma_map_sg(struct device *dev, struct scatterlist *sg,
> < 		int nents, enum dma_data_direction dir)
> < {
> < 	int result, i;
> < 	struct scatterlist *s;
> < 	
> < 	result = dma_map_sg(dev, sg, nents, dir);
> < 	for_each_sg(sg, s, nents, i)
> < 		s->dma_address += plx_dma_addr_offset;
> < 	
> < 	return result;
> < }
> <
> < static void nvme_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
> < 		int nents, enum dma_data_direction dir)
> < {
> < 	int i;
> < 	struct scatterlist *s;
> < 	
> < 	for_each_sg(sg, s, nents, i)
> < 		s->dma_address -= plx_dma_addr_offset;
> <
> < 	dma_unmap_sg(dev, sg, nents, dir);
> < }
> <
> < /* NVMe private MSI interfaces */
> < static int nvme_enable_msix(struct nvme_dev *dev, int nvec)
> < {
> < 	int ret;
> < 	void __iomem *base;
> < 	struct msi_desc *entry;
> <
> < 	ret = pci_enable_msix(dev->pci_dev, dev->entry, nvec);
> < 	if (!ret) {
> < 		list_for_each_entry(entry, &dev->pci_dev->msi_list, list) {
> < 			base = entry->mask_base +
> < 				entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
> <
> < 			entry->msg.address_lo += plx_dma_addr_offset & 0xffffffff;
> < 			entry->msg.address_hi += plx_dma_addr_offset >> 32;
> <
> < 			mask_msix_entry(dev->pci_dev, entry->msi_attrib.entry_nr,
> < 					entry->mask_base + entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE
> < 					+ PCI_MSIX_ENTRY_VECTOR_CTRL, 1);
> < 			/* Flush the updated MSI address */
> < 			writel(entry->msg.address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
> < 			writel(entry->msg.address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
> <
> < 			mask_msix_entry(dev->pci_dev, entry->msi_attrib.entry_nr,
> < 					entry->mask_base + entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE
> < 					+ PCI_MSIX_ENTRY_VECTOR_CTRL, 0);
> < 		}	
> < 	}
> <
> < 	return ret;
> < }
> <
> 289d210
> < 	pr_info("%s: nvmeq %p, free cmdid %d\n", __func__, nvmeq, cmdid);
> 308c229
> < 	return dev->queues[get_cpu()];
> ---
>> 	return dev->queues[get_cpu() + 1];
> 398c319
> < 		nvme_dma_unmap_sg(&dev->pci_dev->dev, iod->sg, iod->nents,
> ---
>> 		dma_unmap_sg(&dev->pci_dev->dev, iod->sg, iod->nents,
> 630c551
> < 	if (nvme_dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir) == 0)
> ---
>> 	if (dma_map_sg(nvmeq->q_dmadev, iod->sg, iod->nents, dma_dir) == 0)
> 699d619
> < 	pr_info("%s: nvmeq %p, alloc cmdid %d\n", __func__, nvmeq, cmdid);
> 734d653
> < 	pr_info("%s: nvmeq %p, alloc cmdid %d\n", __func__, nvmeq, cmdid);
> 832,837d750
> < 		if (!fn) {
> < 			pr_err("%s: nvmeq %p, result %d, sq_head %d,"
> < 					"sq_id %d, command id %d, status %d\n",
> < 					__func__, nvmeq, cqe.result, cqe.sq_head,
> < 					cqe.sq_id, cqe.command_id, cqe.status);
> < 		}
> 915d827
> < 	pr_info("%s: nvmeq %p, alloc cmdid %d\n", __func__, nvmeq, cmdid);
> 941c853
> < 	u32 val;
> ---
>> 	int status;
> 948,950c860,862
> < 	nvme_post_admin_cmd(dev, &c);
> < 	writel(NVME_DATA_VALID, dev->bar4);
> < 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
> ---
>> 	status = nvme_submit_admin_cmd(dev, &c, NULL);
>> 	if (status)
>> 		return -EIO;
> 956a869
>> 	int status;
> 959d871
> < 	u32 val;
> 968,972c880,883
> < 	pr_debug("%s: cq qid %d, prp1 0x%llx, vector %d\n",
> < 			__func__, qid, nvmeq->cq_dma_addr, nvmeq->cq_vector);
> < 	nvme_post_admin_cmd(dev, &c);
> < 	writel(NVME_DATA_VALID, dev->bar4);
> < 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
> ---
>>
>> 	status = nvme_submit_admin_cmd(dev, &c, NULL);
>> 	if (status)
>> 		return -EIO;
> 978a890
>> 	int status;
> 981d892
> < 	u32 val;
> 991,995c902,904
> < 	pr_debug("%s: sq qid %d, prp1 0x%llx\n",
> < 			__func__, qid, nvmeq->sq_dma_addr);
> < 	nvme_post_admin_cmd(dev, &c);
> < 	writel(NVME_DATA_VALID, dev->bar4);
> < 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
> ---
>> 	status = nvme_submit_admin_cmd(dev, &c, NULL);
>> 	if (status)
>> 		return -EIO;
> 1009,1083d917
> < static void nvme_post_admin_cmd(struct nvme_dev *dev,
> < 		struct nvme_command *c)
> < {
> < 	int i;
> < 	u32 *addr = (u32 *)c;
> <
> < 	/* nvme admin command is always 64 bytes */
> < 	for (i = 0; i < 16; i++) {
> < 		writel(*addr, dev->bar4 + 8 + i * 4);
> < 		addr++;
> < 	}
> < 	/* Tag it's a nvme admin command */
> < 	writel(NVME_ADMIN_CMD, dev->bar4 + 4);
> < }
> <
> < static int nvme_recv_data_back(struct nvme_dev *dev,
> < 		void *mem, size_t size)
> < {
> < 	u32 *addr = (u32 *)mem;
> < 	int count = 30, i;
> < 	u32 val;
> <
> < 	while (count--) {
> < 		if (readl(dev->bar4 + NVME_RETURN_OFFSET) ==
> < 				NVME_RETURN_READY) {
> < 			writel(0x0, dev->bar4 + NVME_RETURN_OFFSET);
> <
> < 			val = readl(dev->bar4 + NVME_RETURN_OFFSET + 4);
> < 			writel(0x0, dev->bar4 + NVME_RETURN_OFFSET + 4);
> < 			if (val) {
> < 				/* admin process fail */
> < 				dev_warn(&dev->pci_dev->dev,
> < 						"admin command fail\n");
> < 				return val;
> < 			}
> <
> < 			for (i = 0; i < size; i += 4) {
> < 				*addr = readl(dev->bar4 + NVME_RETURN_OFFSET + 8 + i);
> < 				addr++;
> < 			}
> < 			break;
> < 		}
> < 		msleep(10);
> < 	}
> < 	
> < 	if (!count) {
> < 		dev_warn(&dev->pci_dev->dev,
> < 				"recv admin command data back timeout\n");
> < 		return -1;
> < 	}
> <
> < 	return 0;
> < }
> <
> < int nvme_identify_plx(struct nvme_dev *dev, unsigned nsid, unsigned cns,
> < 							void *mem)
> < {
> < 	struct nvme_command c;
> < 	int val;
> <
> < 	memset(&c, 0, sizeof(c));
> < 	c.identify.opcode = nvme_admin_identify;
> < 	c.identify.nsid = cpu_to_le32(nsid);
> < 	/* prp1 is not necessary, it will be replaced
> < 	 * with MCPU dma address in PLX MGR
> < 	 */
> < 	c.identify.prp1 = cpu_to_le64(0x12345678);
> < 	c.identify.cns = cpu_to_le32(cns);
> < 	
> < 	nvme_post_admin_cmd(dev, &c);
> < 	writel(NVME_DATA_VALID, dev->bar4);
> < 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
> < 	return nvme_recv_data_back(dev, mem, 4096);
> < }
> <
> 1112,1133d945
> < int nvme_get_features_plx(struct nvme_dev *dev, unsigned fid, unsigned nsid,
> < 					void *mem, u32 *result)
> < {
> < 	struct nvme_command c;
> < 	int val;
> <
> < 	memset(&c, 0, sizeof(c));
> < 	c.features.opcode = nvme_admin_get_features;
> < 	c.features.nsid = cpu_to_le32(nsid);
> < 	/* prp1 is not necessary, it will be replaced
> < 	 * with MCPU dma address in PLX MGR, so
> < 	 * 0x12345678 is meaningless here.
> < 	 */
> < 	c.features.prp1 = cpu_to_le64(0x12345678);
> < 	c.features.fid = cpu_to_le32(fid);
> < 	nvme_post_admin_cmd(dev, &c);
> < 	writel(NVME_DATA_VALID, dev->bar4);
> < 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
> <
> < 	return nvme_recv_data_back(dev, mem, 4096);
> < }
> <
> 1138d949
> < 	u32 val;
> 1146,1150c957
> < 	nvme_post_admin_cmd(dev, &c);
> < 	writel(NVME_DATA_VALID, dev->bar4);
> < 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
> <
> < 	return nvme_recv_data_back(dev, result, 4);
> ---
>> 	return nvme_submit_admin_cmd(dev, &c, result);
> 1184c991
> < 	nvme_dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
> ---
>> 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
> 1186c993
> < 	nvme_dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
> ---
>> 	dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
> 1194c1001
> < 	int vector = dev->entry[nvmeq->cq_vector - dev->first + 1].vector;
> ---
>> 	int vector = dev->entry[nvmeq->cq_vector].vector;
> 1207,1209c1014,1018
> < 	/* Hosts don't have admin queue,the IO queues' index are from 0 */
> < 	adapter_delete_sq(dev, qid + dev->first);
> < 	adapter_delete_cq(dev, qid + dev->first);
> ---
>> 	/* Don't tell the adapter to delete the admin queue */
>> 	if (qid) {
>> 		adapter_delete_sq(dev, qid);
>> 		adapter_delete_cq(dev, qid);
>> 	}
> 1224c1033
> < 	nvmeq->cqes = nvme_dma_alloc_coherent(dmadev, CQ_SIZE(depth),
> ---
>> 	nvmeq->cqes = dma_alloc_coherent(dmadev, CQ_SIZE(depth),
> 1230c1039
> < 	nvmeq->sq_cmds = nvme_dma_alloc_coherent(dmadev, SQ_SIZE(depth),
> ---
>> 	nvmeq->sq_cmds = dma_alloc_coherent(dmadev, SQ_SIZE(depth),
> 1250c1059
> < 	nvme_dma_free_coherent(dmadev, CQ_SIZE(depth), (void *)nvmeq->cqes,
> ---
>> 	dma_free_coherent(dmadev, CQ_SIZE(depth), (void *)nvmeq->cqes,
> 1265,1266c1074,1075
> < 	return request_irq(dev->entry[nvmeq->cq_vector - dev->first + 1].vector,
> < 			nvme_irq, IRQF_DISABLED | IRQF_SHARED, name, nvmeq);
> ---
>> 	return request_irq(dev->entry[nvmeq->cq_vector].vector, nvme_irq,
>> 				IRQF_DISABLED | IRQF_SHARED, name, nvmeq);
> 1297c1106
> < 	nvme_dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
> ---
>> 	dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
> 1299c1108
> < 	nvme_dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
> ---
>> 	dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
> 1305d1113
> < #if 0
> 1346c1154,1200
> < #endif
> ---
>>
>> static int nvme_configure_admin_queue(struct nvme_dev *dev)
>> {
>> 	int result;
>> 	u32 aqa;
>> 	u64 cap = readq(&dev->bar->cap);
>> 	struct nvme_queue *nvmeq;
>>
>> 	dev->dbs = ((void __iomem *)dev->bar) + 4096;
>> 	dev->db_stride = NVME_CAP_STRIDE(cap);
>>
>> 	result = nvme_disable_ctrl(dev, cap);
>> 	if (result < 0)
>> 		return result;
>>
>> 	nvmeq = nvme_alloc_queue(dev, 0, 64, 0);
>> 	if (!nvmeq)
>> 		return -ENOMEM;
>>
>> 	aqa = nvmeq->q_depth - 1;
>> 	aqa |= aqa << 16;
>>
>> 	dev->ctrl_config = NVME_CC_ENABLE | NVME_CC_CSS_NVM;
>> 	dev->ctrl_config |= (PAGE_SHIFT - 12) << NVME_CC_MPS_SHIFT;
>> 	dev->ctrl_config |= NVME_CC_ARB_RR | NVME_CC_SHN_NONE;
>> 	dev->ctrl_config |= NVME_CC_IOSQES | NVME_CC_IOCQES;
>>
>> 	writel(aqa, &dev->bar->aqa);
>> 	writeq(nvmeq->sq_dma_addr, &dev->bar->asq);
>> 	writeq(nvmeq->cq_dma_addr, &dev->bar->acq);
>> 	writel(dev->ctrl_config, &dev->bar->cc);
>>
>> 	result = nvme_enable_ctrl(dev, cap);
>> 	if (result)
>> 		goto free_q;
>>
>> 	result = queue_request_irq(dev, nvmeq, "nvme admin");
>> 	if (result)
>> 		goto free_q;
>>
>> 	dev->queues[0] = nvmeq;
>> 	return result;
>>
>>  free_q:
>> 	nvme_free_queue_mem(nvmeq);
>> 	return result;
>> }
> 1388c1242
> < 	nents = nvme_dma_map_sg(&dev->pci_dev->dev, sg, count,
> ---
>> 	nents = dma_map_sg(&dev->pci_dev->dev, sg, count,
> 1410c1264
> < 	nvme_dma_unmap_sg(&dev->pci_dev->dev, iod->sg, iod->nents,
> ---
>> 	dma_unmap_sg(&dev->pci_dev->dev, iod->sg, iod->nents,
> 1470c1324
> < 		meta_mem = nvme_dma_alloc_coherent(&dev->pci_dev->dev, meta_len,
> ---
>> 		meta_mem = dma_alloc_coherent(&dev->pci_dev->dev, meta_len,
> 1522c1376
> < 		nvme_dma_free_coherent(&dev->pci_dev->dev, meta_len, meta_mem,
> ---
>> 		dma_free_coherent(&dev->pci_dev->dev, meta_len, meta_mem,
> 1771c1625
> < static int get_queue_info(struct nvme_dev *dev, int *start, int count)
> ---
>> static int set_queue_count(struct nvme_dev *dev, int count)
> 1773,1788c1627,1629
> < 	u32 result = 0, val;
> < 	int c = 10;
> <
> < 	writel(NVME_IOQ_INFO, dev->bar4 + 4);
> < 	writel(NVME_DATA_VALID, dev->bar4);
> < 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
> <
> < 	while (c--) {
> < 		if (readl(dev->bar4 + NVME_RETURN_OFFSET) ==
> < 				NVME_RETURN_READY) {
> < 			result = readl(dev->bar4 + NVME_RETURN_OFFSET + 8);
> < 			writel(0x0, dev->bar4 + NVME_RETURN_OFFSET);
> < 			break;
> < 		}
> < 		msleep(10);
> < 	}
> ---
>> 	int status;
>> 	u32 result;
>> 	u32 q_count = (count - 1) | ((count - 1) << 16);
> 1790,1795c1631,1635
> < 	/*
> < 	 * MCPU would save the start IO queue number in high 16 bits
> < 	 * the IO queue number is saved in low 16 bits
> < 	 */
> < 	*start = result >> 16;
> < 	return (result & 0xffff);
> ---
>> 	status = nvme_set_features(dev, NVME_FEAT_NUM_QUEUES, q_count, 0,
>> 								&result);
>> 	if (status)
>> 		return -EIO;
>> 	return min(result & 0xffff, result >> 16) + 1;
> 1801,1802c1641
> < 	int result, first, cpu, i, nr_io_queues;
> < 	int db_bar_size, q_depth;
> ---
>> 	int result, cpu, i, nr_io_queues, db_bar_size, q_depth, q_count;
> 1805,1806c1644
> < 	/* "first" is the first io queue id allocated */
> < 	result = get_queue_info(dev, &first, nr_io_queues);
> ---
>> 	result = set_queue_count(dev, nr_io_queues);
> 1809,1810d1646
> < 	if (result == 0 || first == 0)
> < 		return -EPERM;
> 1813,1815c1649,1654
> < 	
> < 	dev->first = first;
> < 	db_bar_size = 4096 + ((first + nr_io_queues) << (dev->db_stride + 3));
> ---
>>
>> 	q_count = nr_io_queues;
>> 	/* Deregister the admin queue's interrupt */
>> 	free_irq(dev->entry[0].vector, dev->queues[0]);
>>
>> 	db_bar_size = 4096 + ((nr_io_queues + 1) << (dev->db_stride + 3));
> 1823,1827d1661
> < 	/*
> < 	 * Admin queue and first io queue share the MSI-X irq
> < 	 * in MCPU, so if io queue id is x, its related vector
> < 	 * should be x-1.
> < 	 */
> 1829c1663
> < 		dev->entry[i].entry = i + first - 1;
> ---
>> 		dev->entry[i].entry = i;
> 1831c1665
> < 		result = nvme_enable_msix(dev, nr_io_queues);
> ---
>> 		result = pci_enable_msix(pdev, dev->entry, nr_io_queues);
> 1842a1677,1697
>> 	if (nr_io_queues == 0) {
>> 		nr_io_queues = q_count;
>> 		for (;;) {
>> 			result = pci_enable_msi_block(pdev, nr_io_queues);
>> 			if (result == 0) {
>> 				for (i = 0; i < nr_io_queues; i++)
>> 					dev->entry[i].vector = i + pdev->irq;
>> 				break;
>> 			} else if (result > 0) {
>> 				nr_io_queues = result;
>> 				continue;
>> 			} else {
>> 				nr_io_queues = 1;
>> 				break;
>> 			}
>> 		}
>> 	}
>>
>> 	result = queue_request_irq(dev, dev->queues[0], "nvme admin");
>> 	/* XXX: handle failure here */
>>
> 1852,1855c1707,1709
> < 		dev->queues[i] = nvme_create_queue(dev, i + first, q_depth,
> < 				i + first -1);
> < 		if (IS_ERR(dev->queues[i]))
> < 			return PTR_ERR(dev->queues[i]);
> ---
>> 		dev->queues[i + 1] = nvme_create_queue(dev, i + 1, q_depth, i);
>> 		if (IS_ERR(dev->queues[i + 1]))
>> 			return PTR_ERR(dev->queues[i + 1]);
> 1860,1861c1714,1715
> < 		int target = i % rounddown_pow_of_two(dev->queue_count);
> < 		dev->queues[i] = dev->queues[target];
> ---
>> 		int target = i % rounddown_pow_of_two(dev->queue_count - 1);
>> 		dev->queues[i + 1] = dev->queues[target + 1];
> 1887a1742
>> 	dma_addr_t dma_addr;
> 1894c1749,1750
> < 	mem = kzalloc(8192, GFP_KERNEL);
> ---
>> 	mem = dma_alloc_coherent(&dev->pci_dev->dev, 8192, &dma_addr,
>> 								GFP_KERNEL);
> 1898c1754
> < 	res = nvme_identify_plx(dev, 0, 1, mem);
> ---
>> 	res = nvme_identify(dev, 0, 1, dma_addr);
> 1918c1774
> < 		res = nvme_identify_plx(dev, i, 0, mem);
> ---
>> 		res = nvme_identify(dev, i, 0, dma_addr);
> 1925,1926c1781,1782
> < 		res = nvme_get_features_plx(dev, NVME_FEAT_LBA_RANGE, i,
> < 							mem + 4096, NULL);
> ---
>> 		res = nvme_get_features(dev, NVME_FEAT_LBA_RANGE, i,
>> 							dma_addr + 4096, NULL);
> 1939c1795
> < 	kfree(mem);
> ---
>> 	dma_free_coherent(&dev->pci_dev->dev, 8192, mem, dma_addr);
> 2049,2074d1904
> < static void nvme_update_dma_addr_offset(struct nvme_dev *dev)
> < {
> < 	int val, count = 10;
> < 	u64	result = 0;
> <
> < 	if (plx_dma_addr_offset)
> < 		return;
> <
> < 	writel(NVME_DMA_ADDR_OFFSET, dev->bar4 + 4);
> < 	writel(NVME_DATA_VALID, dev->bar4);
> < 	pci_read_config_dword(dev->pci_dev, 0x100, &val);
> < 	while (count--) {
> < 		if (readl(dev->bar4 + NVME_RETURN_OFFSET) ==
> < 				NVME_RETURN_READY) {
> <
> < 			result = readq(dev->bar4 + NVME_RETURN_OFFSET + 8);
> < 			writel(0x0, dev->bar4 + NVME_RETURN_OFFSET);
> < 			break;
> < 		}
> < 		msleep(10);
> < 	}
> <
> < 	dev_info(&dev->pci_dev->dev, "PLX dma addr offset: 0x%llx\n", result);
> < 	plx_dma_addr_offset = result;
> < }
> <
> 2098,2112c1928
> < 	struct pci_dev *tmp_dev = NULL;
> < 	u64 cap;
> < 	int flag = 0;	
> <
> < 	pdev->nvme = 1;
> < 	for_each_pci_dev(tmp_dev){
> < 		if(tmp_dev->device == 0x1009){
> < 			flag = 1;
> < 			break;
> < 		}
> < 	}
> < 	if(flag)
> < 		return 0;
> < 	if(pdev->bus->self->device != 0x9797)
> < 		return 0;	
> ---
>>
> 2120c1936
> < 	dev->queues = kcalloc(num_possible_cpus(), sizeof(void *),
> ---
>> 	dev->queues = kcalloc(num_possible_cpus() + 1, sizeof(void *),
> 2146a1963,1964
>> 	dev->entry[0].vector = pdev->irq;
>>
> 2157,2159c1975,1976
> < 	dev->bar4 = ioremap(pci_resource_start(pdev, 4), 16 *1024);
> < 	if (!dev->bar4) {
> < 		result = -ENOMEM;
> ---
>> 	result = nvme_configure_admin_queue(dev);
>> 	if (result)
> 2161,2167c1978
> < 	}
> <
> < 	nvme_update_dma_addr_offset(dev);
> <
> < 	cap = readq(&dev->bar->cap);
> < 	dev->dbs = ((void __iomem *)dev->bar) + 4096;
> < 	dev->db_stride = NVME_CAP_STRIDE(cap);
> ---
>> 	dev->queue_count++;
> 2199d2009
> < 	iounmap(dev->bar4);
> [yijing at localhost linux-3.10-new]$
> 
> 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-07-08  8:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-08 16:08 [PATCH 0/3] NVMe: Initialization error handling fixups Keith Busch
2015-06-08 16:08 ` [PATCH 1/3] NVMe: Fix device cleanup on initialization failure Keith Busch
2015-06-08 16:08 ` [PATCH 2/3] NVMe: Don't use fake status on cancelled command Keith Busch
2015-06-11 10:40   ` Christoph Hellwig
2015-06-11 14:15     ` Keith Busch
2015-06-11 15:23       ` Christoph Hellwig
     [not found]         ` <55935989.70809@huawei.com>
2015-07-01 16:17           ` Question about NVMe share I/O Keith Busch
2015-07-01 16:45             ` James R. Bergsten
2015-07-02  7:11             ` dingxiang
2015-07-02 12:42             ` Yijing Wang
2015-07-02 14:42               ` Keith Busch
2015-07-03  1:24                 ` Yijing Wang
2015-07-08  8:49                   ` dingxiang
2015-06-08 16:08 ` [PATCH 3/3] NVMe: Unify controller probe and resume Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.