All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] few nvme-rdma fixes for 4.18
@ 2018-06-19 12:34 Sagi Grimberg
  2018-06-19 12:34 ` [PATCH 1/7] nvme-rdma: fix possible double free condition when failing to create a controller Sagi Grimberg
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-19 12:34 UTC (permalink / raw)


Patches 1-5 fixes some allocation/free glitches reported. patches
6-7 try to centralize some setup/teardown logic such that fixes
can be applied in a single location (which can probably go to
4.19).

Israel Rukshin (1):
  nvme-rdma: Fix command completion race at error recovery

Sagi Grimberg (6):
  nvme-rdma: fix possible double free condition when failing to create a
    controller
  nvme-rdma: fix possible free non-allocated async event buffer
  nvme-rdma: unquiesce queues when deleting the controller
  nvme-rdma: don't override opts->queue_size
  nvme-rdma: centralize controller setup sequence
  nvme-rdma: centralize admin/io queue teardown sequence

 drivers/nvme/host/rdma.c | 242 +++++++++++++++++++++--------------------------
 1 file changed, 109 insertions(+), 133 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/7] nvme-rdma: fix possible double free condition when failing to create a controller
  2018-06-19 12:34 [PATCH 0/7] few nvme-rdma fixes for 4.18 Sagi Grimberg
@ 2018-06-19 12:34 ` Sagi Grimberg
  2018-06-20  9:06   ` Max Gurtovoy
  2018-06-19 12:34 ` [PATCH 2/7] nvme-rdma: fix possible free non-allocated async event buffer Sagi Grimberg
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-19 12:34 UTC (permalink / raw)


Failures after nvme_init_ctrl will defer resource cleanups to .free_ctrl
when the reference is released, hence we should not free the controller queues
for these failures.

Fix that by moving controller queues allocation before controller initialization
and correctly freeing them for failures before initialization and skip them
for failures after initialization.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index c9424da0d23e..99d213ea95da 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1932,11 +1932,6 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 		goto out_free_ctrl;
 	}
 
-	ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_rdma_ctrl_ops,
-				0 /* no quirks, we're perfect! */);
-	if (ret)
-		goto out_free_ctrl;
-
 	INIT_DELAYED_WORK(&ctrl->reconnect_work,
 			nvme_rdma_reconnect_ctrl_work);
 	INIT_WORK(&ctrl->err_work, nvme_rdma_error_recovery_work);
@@ -1950,14 +1945,19 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 	ctrl->queues = kcalloc(ctrl->ctrl.queue_count, sizeof(*ctrl->queues),
 				GFP_KERNEL);
 	if (!ctrl->queues)
-		goto out_uninit_ctrl;
+		goto out_free_ctrl;
+
+	ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_rdma_ctrl_ops,
+				0 /* no quirks, we're perfect! */);
+	if (ret)
+		goto out_kfree_queues;
 
 	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING);
 	WARN_ON_ONCE(!changed);
 
 	ret = nvme_rdma_configure_admin_queue(ctrl, true);
 	if (ret)
-		goto out_kfree_queues;
+		goto out_uninit_ctrl;
 
 	/* sanity check icdoff */
 	if (ctrl->ctrl.icdoff) {
@@ -2014,14 +2014,14 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 
 out_remove_admin_queue:
 	nvme_rdma_destroy_admin_queue(ctrl, true);
-out_kfree_queues:
-	kfree(ctrl->queues);
 out_uninit_ctrl:
 	nvme_uninit_ctrl(&ctrl->ctrl);
 	nvme_put_ctrl(&ctrl->ctrl);
 	if (ret > 0)
 		ret = -EIO;
 	return ERR_PTR(ret);
+out_kfree_queues:
+	kfree(ctrl->queues);
 out_free_ctrl:
 	kfree(ctrl);
 	return ERR_PTR(ret);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/7] nvme-rdma: fix possible free non-allocated async event buffer
  2018-06-19 12:34 [PATCH 0/7] few nvme-rdma fixes for 4.18 Sagi Grimberg
  2018-06-19 12:34 ` [PATCH 1/7] nvme-rdma: fix possible double free condition when failing to create a controller Sagi Grimberg
@ 2018-06-19 12:34 ` Sagi Grimberg
  2018-06-20  8:31   ` Christoph Hellwig
  2018-06-19 12:34 ` [PATCH 3/7] nvme-rdma: Fix command completion race at error recovery Sagi Grimberg
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-19 12:34 UTC (permalink / raw)


If nvme_rdma_configure_admin_queue fails before we allocated
the async event buffer, we will falsly free it because nvme_rdma_free_queue
is freeing it. Fix it by allocating the buffer right after nvme_rdma_alloc_queue
and free it right before nvme_rdma_queue_free to maintain orderly reverse cleanup
sequence.

Reported-by: Israel Rukshin <israelr at mellanox.com>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 99d213ea95da..cebe527ec6cd 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -560,12 +560,6 @@ static void nvme_rdma_free_queue(struct nvme_rdma_queue *queue)
 	if (!test_and_clear_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
 		return;
 
-	if (nvme_rdma_queue_idx(queue) == 0) {
-		nvme_rdma_free_qe(queue->device->dev,
-			&queue->ctrl->async_event_sqe,
-			sizeof(struct nvme_command), DMA_TO_DEVICE);
-	}
-
 	nvme_rdma_destroy_queue_ib(queue);
 	rdma_destroy_id(queue->cm_id);
 }
@@ -739,6 +733,8 @@ static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl,
 		blk_cleanup_queue(ctrl->ctrl.admin_q);
 		nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.admin_tagset);
 	}
+	nvme_rdma_free_qe(ctrl->device->dev, &ctrl->async_event_sqe,
+		sizeof(struct nvme_command), DMA_TO_DEVICE);
 	nvme_rdma_free_queue(&ctrl->queues[0]);
 }
 
@@ -755,11 +751,16 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 
 	ctrl->max_fr_pages = nvme_rdma_get_max_fr_pages(ctrl->device->dev);
 
+	error = nvme_rdma_alloc_qe(ctrl->device->dev, &ctrl->async_event_sqe,
+			sizeof(struct nvme_command), DMA_TO_DEVICE);
+	if (error)
+		goto out_free_queue;
+
 	if (new) {
 		ctrl->ctrl.admin_tagset = nvme_rdma_alloc_tagset(&ctrl->ctrl, true);
 		if (IS_ERR(ctrl->ctrl.admin_tagset)) {
 			error = PTR_ERR(ctrl->ctrl.admin_tagset);
-			goto out_free_queue;
+			goto out_free_async_qe;
 		}
 
 		ctrl->ctrl.admin_q = blk_mq_init_queue(&ctrl->admin_tag_set);
@@ -795,12 +796,6 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 	if (error)
 		goto out_stop_queue;
 
-	error = nvme_rdma_alloc_qe(ctrl->queues[0].device->dev,
-			&ctrl->async_event_sqe, sizeof(struct nvme_command),
-			DMA_TO_DEVICE);
-	if (error)
-		goto out_stop_queue;
-
 	return 0;
 
 out_stop_queue:
@@ -811,6 +806,9 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 out_free_tagset:
 	if (new)
 		nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.admin_tagset);
+out_free_async_qe:
+	nvme_rdma_free_qe(ctrl->device->dev, &ctrl->async_event_sqe,
+		sizeof(struct nvme_command), DMA_TO_DEVICE);
 out_free_queue:
 	nvme_rdma_free_queue(&ctrl->queues[0]);
 	return error;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/7] nvme-rdma: Fix command completion race at error recovery
  2018-06-19 12:34 [PATCH 0/7] few nvme-rdma fixes for 4.18 Sagi Grimberg
  2018-06-19 12:34 ` [PATCH 1/7] nvme-rdma: fix possible double free condition when failing to create a controller Sagi Grimberg
  2018-06-19 12:34 ` [PATCH 2/7] nvme-rdma: fix possible free non-allocated async event buffer Sagi Grimberg
@ 2018-06-19 12:34 ` Sagi Grimberg
  2018-06-19 12:34 ` [PATCH 4/7] nvme-rdma: unquiesce queues when deleting the controller Sagi Grimberg
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-19 12:34 UTC (permalink / raw)


From: Israel Rukshin <israelr@mellanox.com>

The race is between completing the request at error recovery work
and rdma completions.
If we cancel the request before getting the good rdma completion
we get a NULL deref of the request MR at nvme_rdma_process_nvme_rsp().

When Canceling the request we return its mr to the mr pool (set mr to NULL)
and also unmap its data.
Canceling the requests while the rdma queues are active is not safe.
Because rdma queues are active and we get good rdma completions
that can use the mr pointer which may be NULL.
Completing the request too soon may lead also to performing DMA to/from
user buffers which might have been already unmapped.

The commit fixes the race by draining the QP before
starting the abort commands mechanism.

Signed-off-by: Israel Rukshin <israelr at mellanox.com>
Reviewed-by: Max Gurtovoy <maxg at mellanox.com>
Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index cebe527ec6cd..15e897ac2506 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -728,7 +728,6 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl,
 		bool remove)
 {
-	nvme_rdma_stop_queue(&ctrl->queues[0]);
 	if (remove) {
 		blk_cleanup_queue(ctrl->ctrl.admin_q);
 		nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.admin_tagset);
@@ -817,7 +816,6 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 static void nvme_rdma_destroy_io_queues(struct nvme_rdma_ctrl *ctrl,
 		bool remove)
 {
-	nvme_rdma_stop_io_queues(ctrl);
 	if (remove) {
 		blk_cleanup_queue(ctrl->ctrl.connect_q);
 		nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.tagset);
@@ -947,6 +945,7 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 	return;
 
 destroy_admin:
+	nvme_rdma_stop_queue(&ctrl->queues[0]);
 	nvme_rdma_destroy_admin_queue(ctrl, false);
 requeue:
 	dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
@@ -963,12 +962,14 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
 
 	if (ctrl->ctrl.queue_count > 1) {
 		nvme_stop_queues(&ctrl->ctrl);
+		nvme_rdma_stop_io_queues(ctrl);
 		blk_mq_tagset_busy_iter(&ctrl->tag_set,
 					nvme_cancel_request, &ctrl->ctrl);
 		nvme_rdma_destroy_io_queues(ctrl, false);
 	}
 
 	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
+	nvme_rdma_stop_queue(&ctrl->queues[0]);
 	blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
 				nvme_cancel_request, &ctrl->ctrl);
 	nvme_rdma_destroy_admin_queue(ctrl, false);
@@ -1734,6 +1735,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown)
 {
 	if (ctrl->ctrl.queue_count > 1) {
 		nvme_stop_queues(&ctrl->ctrl);
+		nvme_rdma_stop_io_queues(ctrl);
 		blk_mq_tagset_busy_iter(&ctrl->tag_set,
 					nvme_cancel_request, &ctrl->ctrl);
 		nvme_rdma_destroy_io_queues(ctrl, shutdown);
@@ -1745,6 +1747,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown)
 		nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap);
 
 	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
+	nvme_rdma_stop_queue(&ctrl->queues[0]);
 	blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
 				nvme_cancel_request, &ctrl->ctrl);
 	blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
@@ -2011,6 +2014,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 	return &ctrl->ctrl;
 
 out_remove_admin_queue:
+	nvme_rdma_stop_queue(&ctrl->queues[0]);
 	nvme_rdma_destroy_admin_queue(ctrl, true);
 out_uninit_ctrl:
 	nvme_uninit_ctrl(&ctrl->ctrl);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/7] nvme-rdma: unquiesce queues when deleting the controller
  2018-06-19 12:34 [PATCH 0/7] few nvme-rdma fixes for 4.18 Sagi Grimberg
                   ` (2 preceding siblings ...)
  2018-06-19 12:34 ` [PATCH 3/7] nvme-rdma: Fix command completion race at error recovery Sagi Grimberg
@ 2018-06-19 12:34 ` Sagi Grimberg
  2018-06-21  9:57   ` Max Gurtovoy
  2018-06-19 12:34 ` [PATCH 5/7] nvme-rdma: don't override opts->queue_size Sagi Grimberg
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-19 12:34 UTC (permalink / raw)


If the controller is going away, we need to unquiesce the io
queues so that all pending request can fail gracefully before
moving forward with controller deletion.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 15e897ac2506..edf63cd106ff 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1741,10 +1741,12 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown)
 		nvme_rdma_destroy_io_queues(ctrl, shutdown);
 	}
 
-	if (shutdown)
+	if (shutdown) {
+		nvme_start_queues(&ctrl->ctrl);
 		nvme_shutdown_ctrl(&ctrl->ctrl);
-	else
+	} else {
 		nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap);
+	}
 
 	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
 	nvme_rdma_stop_queue(&ctrl->queues[0]);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/7] nvme-rdma: don't override opts->queue_size
  2018-06-19 12:34 [PATCH 0/7] few nvme-rdma fixes for 4.18 Sagi Grimberg
                   ` (3 preceding siblings ...)
  2018-06-19 12:34 ` [PATCH 4/7] nvme-rdma: unquiesce queues when deleting the controller Sagi Grimberg
@ 2018-06-19 12:34 ` Sagi Grimberg
  2018-06-19 16:56   ` Daniel Verkamp
  2018-06-19 12:34 ` [PATCH 6/7] nvme-rdma: centralize controller setup sequence Sagi Grimberg
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-19 12:34 UTC (permalink / raw)


That is user argument, and theoretically controller limits can
change over time (over reconnects/resets). Instead, use sqsize
controller attribute to check queue depth boundaries and use it
to the tagset allocation.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index edf63cd106ff..9dc1c37b5f65 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -692,7 +692,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set = &ctrl->tag_set;
 		memset(set, 0, sizeof(*set));
 		set->ops = &nvme_rdma_mq_ops;
-		set->queue_depth = nctrl->opts->queue_size;
+		set->queue_depth = nctrl->sqsize + 1;
 		set->reserved_tags = 1; /* fabric connect */
 		set->numa_node = NUMA_NO_NODE;
 		set->flags = BLK_MQ_F_SHOULD_MERGE;
@@ -1977,20 +1977,22 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 		goto out_remove_admin_queue;
 	}
 
-	if (opts->queue_size > ctrl->ctrl.maxcmd) {
-		/* warn if maxcmd is lower than queue_size */
-		dev_warn(ctrl->ctrl.device,
-			"queue_size %zu > ctrl maxcmd %u, clamping down\n",
-			opts->queue_size, ctrl->ctrl.maxcmd);
-		opts->queue_size = ctrl->ctrl.maxcmd;
-	}
-
 	if (opts->queue_size > ctrl->ctrl.sqsize + 1) {
-		/* warn if sqsize is lower than queue_size */
+		/*
+		 * warn if sqsize is lower than queue_size actual calmping
+		 * will be by using sqsize for the tagset allocation later.
+		 */
 		dev_warn(ctrl->ctrl.device,
 			"queue_size %zu > ctrl sqsize %u, clamping down\n",
 			opts->queue_size, ctrl->ctrl.sqsize + 1);
-		opts->queue_size = ctrl->ctrl.sqsize + 1;
+	}
+
+	if (ctrl->ctrl.sqsize + 1 > ctrl->ctrl.maxcmd) {
+		/* warn if maxcmd is lower than sqsize+1 */
+		dev_warn(ctrl->ctrl.device,
+			"sqsize %u > ctrl maxcmd %u, clamping down\n",
+			ctrl->ctrl.sqsize + 1, ctrl->ctrl.maxcmd);
+		ctrl->ctrl.sqsize = ctrl->ctrl.maxcmd - 1;
 	}
 
 	if (opts->nr_io_queues) {
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/7] nvme-rdma: centralize controller setup sequence
  2018-06-19 12:34 [PATCH 0/7] few nvme-rdma fixes for 4.18 Sagi Grimberg
                   ` (4 preceding siblings ...)
  2018-06-19 12:34 ` [PATCH 5/7] nvme-rdma: don't override opts->queue_size Sagi Grimberg
@ 2018-06-19 12:34 ` Sagi Grimberg
  2018-06-19 12:34 ` [PATCH 7/7] nvme-rdma: centralize admin/io queue teardown sequence Sagi Grimberg
  2018-06-20  8:40 ` [PATCH 0/7] few nvme-rdma fixes for 4.18 Christoph Hellwig
  7 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-19 12:34 UTC (permalink / raw)


Centralize controller sequence to a single routine that
correctly cleans up after failures instead of having multiple
apperances in several flows (create, reset, reconnect).

One thing that we also gain here are the sanity/boundary checks also
when connecting back to a dynamic controller.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 137 ++++++++++++++++++++---------------------------
 1 file changed, 58 insertions(+), 79 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 9dc1c37b5f65..33add5501ed6 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -909,21 +909,47 @@ static void nvme_rdma_reconnect_or_remove(struct nvme_rdma_ctrl *ctrl)
 	}
 }
 
-static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
+static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new)
 {
-	struct nvme_rdma_ctrl *ctrl = container_of(to_delayed_work(work),
-			struct nvme_rdma_ctrl, reconnect_work);
 	bool changed;
 	int ret;
 
-	++ctrl->ctrl.nr_reconnects;
-
-	ret = nvme_rdma_configure_admin_queue(ctrl, false);
+	ret = nvme_rdma_configure_admin_queue(ctrl, new);
 	if (ret)
-		goto requeue;
+		return ret;
+
+	/* sanity check icdoff */
+	if (ctrl->ctrl.icdoff) {
+		dev_err(ctrl->ctrl.device, "icdoff is not supported!\n");
+		ret = -EINVAL;
+		goto destroy_admin;
+	}
+
+	/* sanity check keyed sgls */
+	if (!(ctrl->ctrl.sgls & (1 << 2))) {
+		dev_err(ctrl->ctrl.device,
+			"Mandatory keyed sgls are not supported!\n");
+		ret = -EINVAL;
+		goto destroy_admin;
+	}
+
+	if (ctrl->ctrl.opts->queue_size > ctrl->ctrl.sqsize + 1) {
+		/* warn if sqsize is lower than queue_size */
+		dev_warn(ctrl->ctrl.device,
+			"queue_size %zu > ctrl sqsize %u, clamping down\n",
+			ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
+	}
+
+	if (ctrl->ctrl.sqsize + 1 > ctrl->ctrl.maxcmd) {
+		/* warn if maxcmd is lower than queue_size */
+		dev_warn(ctrl->ctrl.device,
+			"sqsize %u > ctrl maxcmd %u, clamping down\n",
+			ctrl->ctrl.sqsize + 1, ctrl->ctrl.maxcmd);
+		ctrl->ctrl.sqsize = ctrl->ctrl.maxcmd - 1;
+	}
 
 	if (ctrl->ctrl.queue_count > 1) {
-		ret = nvme_rdma_configure_io_queues(ctrl, false);
+		ret = nvme_rdma_configure_io_queues(ctrl, new);
 		if (ret)
 			goto destroy_admin;
 	}
@@ -932,11 +958,32 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 	if (!changed) {
 		/* state change failure is ok if we're in DELETING state */
 		WARN_ON_ONCE(ctrl->ctrl.state != NVME_CTRL_DELETING);
-		return;
+		ret = -EINVAL;
+		goto destroy_io;
 	}
 
 	nvme_start_ctrl(&ctrl->ctrl);
 
+	return 0;
+destroy_io:
+	if (ctrl->ctrl.queue_count > 1)
+		nvme_rdma_destroy_io_queues(ctrl, new);
+destroy_admin:
+	nvme_rdma_stop_queue(&ctrl->queues[0]);
+	nvme_rdma_destroy_admin_queue(ctrl, new);
+	return ret;
+}
+
+static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
+{
+	struct nvme_rdma_ctrl *ctrl = container_of(to_delayed_work(work),
+			struct nvme_rdma_ctrl, reconnect_work);
+
+	++ctrl->ctrl.nr_reconnects;
+
+	if (nvme_rdma_setup_ctrl(ctrl, false))
+		goto requeue;
+
 	dev_info(ctrl->ctrl.device, "Successfully reconnected (%d attempts)\n",
 			ctrl->ctrl.nr_reconnects);
 
@@ -944,9 +991,6 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 
 	return;
 
-destroy_admin:
-	nvme_rdma_stop_queue(&ctrl->queues[0]);
-	nvme_rdma_destroy_admin_queue(ctrl, false);
 requeue:
 	dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
 			ctrl->ctrl.nr_reconnects);
@@ -1765,8 +1809,6 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
 {
 	struct nvme_rdma_ctrl *ctrl =
 		container_of(work, struct nvme_rdma_ctrl, ctrl.reset_work);
-	int ret;
-	bool changed;
 
 	nvme_stop_ctrl(&ctrl->ctrl);
 	nvme_rdma_shutdown_ctrl(ctrl, false);
@@ -1777,25 +1819,9 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
 		return;
 	}
 
-	ret = nvme_rdma_configure_admin_queue(ctrl, false);
-	if (ret)
+	if (nvme_rdma_setup_ctrl(ctrl, false))
 		goto out_fail;
 
-	if (ctrl->ctrl.queue_count > 1) {
-		ret = nvme_rdma_configure_io_queues(ctrl, false);
-		if (ret)
-			goto out_fail;
-	}
-
-	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
-	if (!changed) {
-		/* state change failure is ok if we're in DELETING state */
-		WARN_ON_ONCE(ctrl->ctrl.state != NVME_CTRL_DELETING);
-		return;
-	}
-
-	nvme_start_ctrl(&ctrl->ctrl);
-
 	return;
 
 out_fail:
@@ -1958,52 +1984,10 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING);
 	WARN_ON_ONCE(!changed);
 
-	ret = nvme_rdma_configure_admin_queue(ctrl, true);
+	ret = nvme_rdma_setup_ctrl(ctrl, true);
 	if (ret)
 		goto out_uninit_ctrl;
 
-	/* sanity check icdoff */
-	if (ctrl->ctrl.icdoff) {
-		dev_err(ctrl->ctrl.device, "icdoff is not supported!\n");
-		ret = -EINVAL;
-		goto out_remove_admin_queue;
-	}
-
-	/* sanity check keyed sgls */
-	if (!(ctrl->ctrl.sgls & (1 << 2))) {
-		dev_err(ctrl->ctrl.device,
-			"Mandatory keyed sgls are not supported!\n");
-		ret = -EINVAL;
-		goto out_remove_admin_queue;
-	}
-
-	if (opts->queue_size > ctrl->ctrl.sqsize + 1) {
-		/*
-		 * warn if sqsize is lower than queue_size actual calmping
-		 * will be by using sqsize for the tagset allocation later.
-		 */
-		dev_warn(ctrl->ctrl.device,
-			"queue_size %zu > ctrl sqsize %u, clamping down\n",
-			opts->queue_size, ctrl->ctrl.sqsize + 1);
-	}
-
-	if (ctrl->ctrl.sqsize + 1 > ctrl->ctrl.maxcmd) {
-		/* warn if maxcmd is lower than sqsize+1 */
-		dev_warn(ctrl->ctrl.device,
-			"sqsize %u > ctrl maxcmd %u, clamping down\n",
-			ctrl->ctrl.sqsize + 1, ctrl->ctrl.maxcmd);
-		ctrl->ctrl.sqsize = ctrl->ctrl.maxcmd - 1;
-	}
-
-	if (opts->nr_io_queues) {
-		ret = nvme_rdma_configure_io_queues(ctrl, true);
-		if (ret)
-			goto out_remove_admin_queue;
-	}
-
-	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
-	WARN_ON_ONCE(!changed);
-
 	dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISpcs\n",
 		ctrl->ctrl.opts->subsysnqn, &ctrl->addr);
 
@@ -2013,13 +1997,8 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 	list_add_tail(&ctrl->list, &nvme_rdma_ctrl_list);
 	mutex_unlock(&nvme_rdma_ctrl_mutex);
 
-	nvme_start_ctrl(&ctrl->ctrl);
-
 	return &ctrl->ctrl;
 
-out_remove_admin_queue:
-	nvme_rdma_stop_queue(&ctrl->queues[0]);
-	nvme_rdma_destroy_admin_queue(ctrl, true);
 out_uninit_ctrl:
 	nvme_uninit_ctrl(&ctrl->ctrl);
 	nvme_put_ctrl(&ctrl->ctrl);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 7/7] nvme-rdma: centralize admin/io queue teardown sequence
  2018-06-19 12:34 [PATCH 0/7] few nvme-rdma fixes for 4.18 Sagi Grimberg
                   ` (5 preceding siblings ...)
  2018-06-19 12:34 ` [PATCH 6/7] nvme-rdma: centralize controller setup sequence Sagi Grimberg
@ 2018-06-19 12:34 ` Sagi Grimberg
  2018-06-20  8:40 ` [PATCH 0/7] few nvme-rdma fixes for 4.18 Christoph Hellwig
  7 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-19 12:34 UTC (permalink / raw)


We follow the same queue teardown sequence in delete, reset
and error recovery. Centralize the logic. This patch does
not change any functionality.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 61 +++++++++++++++++++++---------------------------
 1 file changed, 26 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 33add5501ed6..f6d98baaf31f 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -865,6 +865,27 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
 	return ret;
 }
 
+static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl, bool remove)
+{
+	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
+	nvme_rdma_stop_queue(&ctrl->queues[0]);
+	blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
+				nvme_cancel_request, &ctrl->ctrl);
+	blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
+	nvme_rdma_destroy_admin_queue(ctrl, remove);
+}
+
+static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl, bool remove)
+{
+	if (ctrl->ctrl.queue_count > 1) {
+		nvme_stop_queues(&ctrl->ctrl);
+		nvme_rdma_stop_io_queues(ctrl);
+		blk_mq_tagset_busy_iter(&ctrl->tag_set,
+					nvme_cancel_request, &ctrl->ctrl);
+		nvme_rdma_destroy_io_queues(ctrl, remove);
+	}
+}
+
 static void nvme_rdma_stop_ctrl(struct nvme_ctrl *nctrl)
 {
 	struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(nctrl);
@@ -1003,27 +1024,10 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
 			struct nvme_rdma_ctrl, err_work);
 
 	nvme_stop_keep_alive(&ctrl->ctrl);
-
-	if (ctrl->ctrl.queue_count > 1) {
-		nvme_stop_queues(&ctrl->ctrl);
-		nvme_rdma_stop_io_queues(ctrl);
-		blk_mq_tagset_busy_iter(&ctrl->tag_set,
-					nvme_cancel_request, &ctrl->ctrl);
-		nvme_rdma_destroy_io_queues(ctrl, false);
-	}
-
-	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
-	nvme_rdma_stop_queue(&ctrl->queues[0]);
-	blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
-				nvme_cancel_request, &ctrl->ctrl);
-	nvme_rdma_destroy_admin_queue(ctrl, false);
-
-	/*
-	 * queues are not a live anymore, so restart the queues to fail fast
-	 * new IO
-	 */
-	blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
+	nvme_rdma_teardown_io_queues(ctrl, false);
+	/* fail fast new IO */
 	nvme_start_queues(&ctrl->ctrl);
+	nvme_rdma_teardown_admin_queue(ctrl, false);
 
 	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
 		/* state change failure is ok if we're in DELETING state */
@@ -1777,27 +1781,14 @@ static const struct blk_mq_ops nvme_rdma_admin_mq_ops = {
 
 static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown)
 {
-	if (ctrl->ctrl.queue_count > 1) {
-		nvme_stop_queues(&ctrl->ctrl);
-		nvme_rdma_stop_io_queues(ctrl);
-		blk_mq_tagset_busy_iter(&ctrl->tag_set,
-					nvme_cancel_request, &ctrl->ctrl);
-		nvme_rdma_destroy_io_queues(ctrl, shutdown);
-	}
-
+	nvme_rdma_teardown_io_queues(ctrl, shutdown);
 	if (shutdown) {
 		nvme_start_queues(&ctrl->ctrl);
 		nvme_shutdown_ctrl(&ctrl->ctrl);
 	} else {
 		nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap);
 	}
-
-	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
-	nvme_rdma_stop_queue(&ctrl->queues[0]);
-	blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
-				nvme_cancel_request, &ctrl->ctrl);
-	blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
-	nvme_rdma_destroy_admin_queue(ctrl, shutdown);
+	nvme_rdma_teardown_admin_queue(ctrl, shutdown);
 }
 
 static void nvme_rdma_delete_ctrl(struct nvme_ctrl *ctrl)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/7] nvme-rdma: don't override opts->queue_size
  2018-06-19 12:34 ` [PATCH 5/7] nvme-rdma: don't override opts->queue_size Sagi Grimberg
@ 2018-06-19 16:56   ` Daniel Verkamp
  0 siblings, 0 replies; 25+ messages in thread
From: Daniel Verkamp @ 2018-06-19 16:56 UTC (permalink / raw)


On 06/19/2018 05:34 AM, Sagi Grimberg wrote:
> That is user argument, and theoretically controller limits can
> change over time (over reconnects/resets). Instead, use sqsize
> controller attribute to check queue depth boundaries and use it
> to the tagset allocation.
[...]
>  	if (opts->queue_size > ctrl->ctrl.sqsize + 1) {
> -		/* warn if sqsize is lower than queue_size */
> +		/*
> +		 * warn if sqsize is lower than queue_size actual calmping
> +		 * will be by using sqsize for the tagset allocation later.
> +		 */

This comment is a bit hard to follow; I think it's meant to say
"clamping", and it could probably use a semicolon or period before "actual".

Thanks,
-- Daniel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/7] nvme-rdma: fix possible free non-allocated async event buffer
  2018-06-19 12:34 ` [PATCH 2/7] nvme-rdma: fix possible free non-allocated async event buffer Sagi Grimberg
@ 2018-06-20  8:31   ` Christoph Hellwig
  2018-06-20 12:02     ` Max Gurtovoy
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2018-06-20  8:31 UTC (permalink / raw)


On Tue, Jun 19, 2018@03:34:10PM +0300, Sagi Grimberg wrote:
> If nvme_rdma_configure_admin_queue fails before we allocated
> the async event buffer, we will falsly free it because nvme_rdma_free_queue
> is freeing it. Fix it by allocating the buffer right after nvme_rdma_alloc_queue
> and free it right before nvme_rdma_queue_free to maintain orderly reverse cleanup
> sequence.

This looks much nicer indeed.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/7] few nvme-rdma fixes for 4.18
  2018-06-19 12:34 [PATCH 0/7] few nvme-rdma fixes for 4.18 Sagi Grimberg
                   ` (6 preceding siblings ...)
  2018-06-19 12:34 ` [PATCH 7/7] nvme-rdma: centralize admin/io queue teardown sequence Sagi Grimberg
@ 2018-06-20  8:40 ` Christoph Hellwig
  2018-06-20  8:40   ` Sagi Grimberg
  7 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2018-06-20  8:40 UTC (permalink / raw)


On Tue, Jun 19, 2018@03:34:08PM +0300, Sagi Grimberg wrote:
> Patches 1-5 fixes some allocation/free glitches reported. patches
> 6-7 try to centralize some setup/teardown logic such that fixes
> can be applied in a single location (which can probably go to
> 4.19).

Yeah, the last two are clearly 4.19 material.

But even the ones before are pretty invasive.  I'll apply them to
nvme-4.18 for now, but I feel a bit uneasy.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/7] few nvme-rdma fixes for 4.18
  2018-06-20  8:40 ` [PATCH 0/7] few nvme-rdma fixes for 4.18 Christoph Hellwig
@ 2018-06-20  8:40   ` Sagi Grimberg
  2018-06-20  8:52     ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-20  8:40 UTC (permalink / raw)



>> Patches 1-5 fixes some allocation/free glitches reported. patches
>> 6-7 try to centralize some setup/teardown logic such that fixes
>> can be applied in a single location (which can probably go to
>> 4.19).
> 
> Yeah, the last two are clearly 4.19 material.
> 
> But even the ones before are pretty invasive.  I'll apply them to
> nvme-4.18 for now, but I feel a bit uneasy.

Let me try and ease your feel ;)

1-3 fix crashes, so they are must.
4 is needed to prevent a theoretical hang in controller removal, but
never encountered.
5 is theoretical to a case where a dynamic controller changes its
attributes.

So 4-5 can safely be included in 4.19

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/7] few nvme-rdma fixes for 4.18
  2018-06-20  8:40   ` Sagi Grimberg
@ 2018-06-20  8:52     ` Christoph Hellwig
  2018-06-20  8:53       ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2018-06-20  8:52 UTC (permalink / raw)


On Wed, Jun 20, 2018@11:40:22AM +0300, Sagi Grimberg wrote:
> 4 is needed to prevent a theoretical hang in controller removal, but
> never encountered.

This one I'm indeed a little concerned about as I remember various
issues in this area in PCIe.  And RDMA is just different enough that
it doesn't look very similar.

> 5 is theoretical to a case where a dynamic controller changes its
> attributes.

5 looks simple enough (famous last words).

> So 4-5 can safely be included in 4.19

Ok, I've dropped 4 for now.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/7] few nvme-rdma fixes for 4.18
  2018-06-20  8:52     ` Christoph Hellwig
@ 2018-06-20  8:53       ` Sagi Grimberg
  2018-06-20  9:05         ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-20  8:53 UTC (permalink / raw)



>> 4 is needed to prevent a theoretical hang in controller removal, but
>> never encountered.
> 
> This one I'm indeed a little concerned about as I remember various
> issues in this area in PCIe.  And RDMA is just different enough that
> it doesn't look very similar.

Actually, 4 follows PCIe one to one:
--
         /*
          * The driver will not be starting up queues again if shutting 
down so
          * must flush all entered requests to their failed completion 
to avoid
          * deadlocking blk-mq hot-cpu notifier.
          */
         if (shutdown)
                 nvme_start_queues(&dev->ctrl);
--
Nothing about this is PCIe specific, its about the controller going away
with quiesced IO.

On a side note, If you look at nvme_reset_work and nvme_dev_disable, its
not that different from what RDMA does, its just arranged slightly
different. I'm still very much think that PCIe and RDMA and FC (and TCP)
can have the same probe, reset, remove sequences. I guess it will just
take some more time to get there.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/7] few nvme-rdma fixes for 4.18
  2018-06-20  8:53       ` Sagi Grimberg
@ 2018-06-20  9:05         ` Christoph Hellwig
  2018-06-20  9:08           ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2018-06-20  9:05 UTC (permalink / raw)


On Wed, Jun 20, 2018@11:53:01AM +0300, Sagi Grimberg wrote:
>
>>> 4 is needed to prevent a theoretical hang in controller removal, but
>>> never encountered.
>>
>> This one I'm indeed a little concerned about as I remember various
>> issues in this area in PCIe.  And RDMA is just different enough that
>> it doesn't look very similar.
>
> Actually, 4 follows PCIe one to one:
> --
>         /*
>          * The driver will not be starting up queues again if shutting down 
> so
>          * must flush all entered requests to their failed completion to 
> avoid
>          * deadlocking blk-mq hot-cpu notifier.
>          */
>         if (shutdown)
>                 nvme_start_queues(&dev->ctrl);
> --
> Nothing about this is PCIe specific, its about the controller going away
> with quiesced IO.

With the difference that PCIe does it at the end of nvme_dev_disable,
long after we've actually disabled or shut down the controller,
while RDMA does it before shutting down the controller.

> On a side note, If you look at nvme_reset_work and nvme_dev_disable, its
> not that different from what RDMA does, its just arranged slightly
> different. I'm still very much think that PCIe and RDMA and FC (and TCP)
> can have the same probe, reset, remove sequences. I guess it will just
> take some more time to get there.

Sharing more code would be great, especially for managing the block
layer state.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/7] nvme-rdma: fix possible double free condition when failing to create a controller
  2018-06-19 12:34 ` [PATCH 1/7] nvme-rdma: fix possible double free condition when failing to create a controller Sagi Grimberg
@ 2018-06-20  9:06   ` Max Gurtovoy
  2018-06-20 10:41     ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Max Gurtovoy @ 2018-06-20  9:06 UTC (permalink / raw)




On 6/19/2018 3:34 PM, Sagi Grimberg wrote:
> Failures after nvme_init_ctrl will defer resource cleanups to .free_ctrl
> when the reference is released, hence we should not free the controller queues
> for these failures.

see below.

> 
> Fix that by moving controller queues allocation before controller initialization
> and correctly freeing them for failures before initialization and skip them
> for failures after initialization.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>   drivers/nvme/host/rdma.c | 18 +++++++++---------
>   1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index c9424da0d23e..99d213ea95da 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1932,11 +1932,6 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
>   		goto out_free_ctrl;
>   	}
>   
> -	ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_rdma_ctrl_ops,
> -				0 /* no quirks, we're perfect! */);
> -	if (ret)
> -		goto out_free_ctrl;
> -
>   	INIT_DELAYED_WORK(&ctrl->reconnect_work,
>   			nvme_rdma_reconnect_ctrl_work);
>   	INIT_WORK(&ctrl->err_work, nvme_rdma_error_recovery_work);
> @@ -1950,14 +1945,19 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
>   	ctrl->queues = kcalloc(ctrl->ctrl.queue_count, sizeof(*ctrl->queues),
>   				GFP_KERNEL);
>   	if (!ctrl->queues)
> -		goto out_uninit_ctrl;
> +		goto out_free_ctrl;
> +
> +	ret = nvme_init_ctrl(&ctrl->ctrl, dev, &nvme_rdma_ctrl_ops,
> +				0 /* no quirks, we're perfect! */);
> +	if (ret)
> +		goto out_kfree_queues;
>   
>   	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING);
>   	WARN_ON_ONCE(!changed);
>   
>   	ret = nvme_rdma_configure_admin_queue(ctrl, true);
>   	if (ret)
> -		goto out_kfree_queues;
> +		goto out_uninit_ctrl;
>   
>   	/* sanity check icdoff */
>   	if (ctrl->ctrl.icdoff) {
> @@ -2014,14 +2014,14 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
>   
>   out_remove_admin_queue:
>   	nvme_rdma_destroy_admin_queue(ctrl, true);
> -out_kfree_queues:
> -	kfree(ctrl->queues);
>   out_uninit_ctrl:
>   	nvme_uninit_ctrl(&ctrl->ctrl);
>   	nvme_put_ctrl(&ctrl->ctrl);
>   	if (ret > 0)
>   		ret = -EIO;
>   	return ERR_PTR(ret);

can you explain this error flow ? we'll never free the queues since the 
list &ctrl->list is empty. I guess a fix to nvme_rdma_free_ctrl is 
needed in this case.
Also, when ret will be > 0 ?


> +out_kfree_queues:
> +	kfree(ctrl->queues);
>   out_free_ctrl:
>   	kfree(ctrl);
>   	return ERR_PTR(ret);
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/7] few nvme-rdma fixes for 4.18
  2018-06-20  9:05         ` Christoph Hellwig
@ 2018-06-20  9:08           ` Sagi Grimberg
  0 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-20  9:08 UTC (permalink / raw)



>>>> 4 is needed to prevent a theoretical hang in controller removal, but
>>>> never encountered.
>>>
>>> This one I'm indeed a little concerned about as I remember various
>>> issues in this area in PCIe.  And RDMA is just different enough that
>>> it doesn't look very similar.
>>
>> Actually, 4 follows PCIe one to one:
>> --
>>          /*
>>           * The driver will not be starting up queues again if shutting down
>> so
>>           * must flush all entered requests to their failed completion to
>> avoid
>>           * deadlocking blk-mq hot-cpu notifier.
>>           */
>>          if (shutdown)
>>                  nvme_start_queues(&dev->ctrl);
>> --
>> Nothing about this is PCIe specific, its about the controller going away
>> with quiesced IO.
> 
> With the difference that PCIe does it at the end of nvme_dev_disable,
> long after we've actually disabled or shut down the controller,
> while RDMA does it before shutting down the controller.

Well yes, but the what matters (to my understanding) is that it happens
after:
1. request queue was quiesced
2. queue does not allow requests to come in anymore (!LIVE)
3. inflight requests were cancelled
4. io hw queues were flushed and teared down.

We can do this after we teardown the admin queue as well, but I don't
see a justification to do it.

>> On a side note, If you look at nvme_reset_work and nvme_dev_disable, its
>> not that different from what RDMA does, its just arranged slightly
>> different. I'm still very much think that PCIe and RDMA and FC (and TCP)
>> can have the same probe, reset, remove sequences. I guess it will just
>> take some more time to get there.
> 
> Sharing more code would be great, especially for managing the block
> layer state.

We'll get there...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/7] nvme-rdma: fix possible double free condition when failing to create a controller
  2018-06-20  9:06   ` Max Gurtovoy
@ 2018-06-20 10:41     ` Sagi Grimberg
  2018-06-20 12:29       ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-20 10:41 UTC (permalink / raw)



>> @@ -2014,14 +2014,14 @@ static struct nvme_ctrl 
>> *nvme_rdma_create_ctrl(struct device *dev,
>> ? out_remove_admin_queue:
>> ????? nvme_rdma_destroy_admin_queue(ctrl, true);
>> -out_kfree_queues:
>> -??? kfree(ctrl->queues);
>> ? out_uninit_ctrl:
>> ????? nvme_uninit_ctrl(&ctrl->ctrl);
>> ????? nvme_put_ctrl(&ctrl->ctrl);
>> ????? if (ret > 0)
>> ????????? ret = -EIO;
>> ????? return ERR_PTR(ret);
> 
> can you explain this error flow ? we'll never free the queues since the 
> list &ctrl->list is empty. I guess a fix to nvme_rdma_free_ctrl is 
> needed in this case.

Good catch! I lost this in the rebase:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 6e45b04ce0f5..4d6953142efd 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -905,9 +905,9 @@ static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
         list_del(&ctrl->list);
         mutex_unlock(&nvme_rdma_ctrl_mutex);

-       kfree(ctrl->queues);
         nvmf_free_options(nctrl->opts);
  free_ctrl:
+       kfree(ctrl->queues);
         kfree(ctrl);
  }
--

> Also, when ret will be > 0 ?

I guess it can't really happen since nvme_rdma_conn_rejected was added.
I can send a nit for that

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/7] nvme-rdma: fix possible free non-allocated async event buffer
  2018-06-20  8:31   ` Christoph Hellwig
@ 2018-06-20 12:02     ` Max Gurtovoy
  2018-06-24 10:40       ` [Suspected-Phishing]Re: " Max Gurtovoy
  0 siblings, 1 reply; 25+ messages in thread
From: Max Gurtovoy @ 2018-06-20 12:02 UTC (permalink / raw)




On 6/20/2018 11:31 AM, Christoph Hellwig wrote:
> On Tue, Jun 19, 2018@03:34:10PM +0300, Sagi Grimberg wrote:
>> If nvme_rdma_configure_admin_queue fails before we allocated
>> the async event buffer, we will falsly free it because nvme_rdma_free_queue
>> is freeing it. Fix it by allocating the buffer right after nvme_rdma_alloc_queue
>> and free it right before nvme_rdma_queue_free to maintain orderly reverse cleanup
>> sequence.
> 
> This looks much nicer indeed.
> 

Looks good,

Reviewed-by: Max Gurtovoy <maxg at mellanox.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/7] nvme-rdma: fix possible double free condition when failing to create a controller
  2018-06-20 10:41     ` Sagi Grimberg
@ 2018-06-20 12:29       ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2018-06-20 12:29 UTC (permalink / raw)


On Wed, Jun 20, 2018@01:41:29PM +0300, Sagi Grimberg wrote:
>> can you explain this error flow ? we'll never free the queues since the 
>> list &ctrl->list is empty. I guess a fix to nvme_rdma_free_ctrl is needed 
>> in this case.
>
> Good catch! I lost this in the rebase:

Thanks, folded.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/7] nvme-rdma: unquiesce queues when deleting the controller
  2018-06-19 12:34 ` [PATCH 4/7] nvme-rdma: unquiesce queues when deleting the controller Sagi Grimberg
@ 2018-06-21  9:57   ` Max Gurtovoy
  2018-06-24 16:07     ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Max Gurtovoy @ 2018-06-21  9:57 UTC (permalink / raw)




On 6/19/2018 3:34 PM, Sagi Grimberg wrote:
> If the controller is going away, we need to unquiesce the io
> queues so that all pending request can fail gracefully before
> moving forward with controller deletion.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>   drivers/nvme/host/rdma.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 15e897ac2506..edf63cd106ff 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1741,10 +1741,12 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown)
>   		nvme_rdma_destroy_io_queues(ctrl, shutdown);
>   	}
>   
> -	if (shutdown)
> +	if (shutdown) {
> +		nvme_start_queues(&ctrl->ctrl);

any reason we unquiesce IO queues after destroying them and unquiesce 
Admin queue before the destruction ?

>   		nvme_shutdown_ctrl(&ctrl->ctrl);
> -	else
> +	} else {
>   		nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap);
> +	}
>   
>   	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
>   	nvme_rdma_stop_queue(&ctrl->queues[0]);
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Suspected-Phishing]Re: [PATCH 2/7] nvme-rdma: fix possible free non-allocated async event buffer
  2018-06-20 12:02     ` Max Gurtovoy
@ 2018-06-24 10:40       ` Max Gurtovoy
  2018-06-24 16:00         ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Max Gurtovoy @ 2018-06-24 10:40 UTC (permalink / raw)


Hi all,

On 6/20/2018 3:02 PM, Max Gurtovoy wrote:
> 
> 
> On 6/20/2018 11:31 AM, Christoph Hellwig wrote:
>> On Tue, Jun 19, 2018@03:34:10PM +0300, Sagi Grimberg wrote:
>>> If nvme_rdma_configure_admin_queue fails before we allocated
>>> the async event buffer, we will falsly free it because 
>>> nvme_rdma_free_queue
>>> is freeing it. Fix it by allocating the buffer right after 
>>> nvme_rdma_alloc_queue
>>> and free it right before nvme_rdma_queue_free to maintain orderly 
>>> reverse cleanup
>>> sequence.
>>
>> This looks much nicer indeed.
>>
> 
> Looks good,
> 
> Reviewed-by: Max Gurtovoy <maxg at mellanox.com>

this fix still doesn't fix the issue that V1 from Israel did.
we still can call nvme_rdma_destroy_admin_queue few times (as I 
mentioned nvme_rdma_free_queue is safe by NVME_RDMA_Q_ALLOCATED bit but 
nvme_rdma_destroy_admin_queue isn't).

I added this check but I think this is not enough since things can run 
concurrently (I still need to make some testing and review):

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 5194568..e99ff80 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -168,8 +168,11 @@ static inline size_t 
nvme_rdma_inline_data_size(struct nvme_rdma_queue *queue)
  static void nvme_rdma_free_qe(struct ib_device *ibdev, struct 
nvme_rdma_qe *qe,
                 size_t capsule_size, enum dma_data_direction dir)
  {
-       ib_dma_unmap_single(ibdev, qe->dma, capsule_size, dir);
-       kfree(qe->data);
+       if (qe->data) {
+               ib_dma_unmap_single(ibdev, qe->dma, capsule_size, dir);
+               kfree(qe->data);
+               qe->data = NULL;
+       }
  }





> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-nvme&data=02%7C01%7Cmaxg%40mellanox.com%7Ce2bebd0648f74427466408d5d6a5b1ab%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636650929500779816&sdata=Qss9K8ON2dKd%2FQPLIcJG7voRuH4dVffx5Sjr7D7GNzc%3D&reserved=0 
> 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [Suspected-Phishing]Re: [PATCH 2/7] nvme-rdma: fix possible free non-allocated async event buffer
  2018-06-24 10:40       ` [Suspected-Phishing]Re: " Max Gurtovoy
@ 2018-06-24 16:00         ` Sagi Grimberg
  2018-06-24 16:19           ` Max Gurtovoy
  0 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-24 16:00 UTC (permalink / raw)



> this fix still doesn't fix the issue that V1 from Israel did.

This fixes the issue that was reported (nvme_rdma_free_queue frees
unallocated async_event_sqe).

> we still can call nvme_rdma_destroy_admin_queue few times (as I 
> mentioned nvme_rdma_free_queue is safe by NVME_RDMA_Q_ALLOCATED bit but 
> nvme_rdma_destroy_admin_queue isn't).

ugh... yes this is not enough to avoid the double free if
reset/reconnect calls nvme_delete_ctrl.

> I added this check but I think this is not enough since things can run 
> concurrently (I still need to make some testing and review):

The destroy routines are serialized by the work flush/cancel so it
should be fine.

> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 5194568..e99ff80 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -168,8 +168,11 @@ static inline size_t 
> nvme_rdma_inline_data_size(struct nvme_rdma_queue *queue)
>  ?static void nvme_rdma_free_qe(struct ib_device *ibdev, struct 
> nvme_rdma_qe *qe,
>  ??????????????? size_t capsule_size, enum dma_data_direction dir)
>  ?{
> -?????? ib_dma_unmap_single(ibdev, qe->dma, capsule_size, dir);
> -?????? kfree(qe->data);
> +?????? if (qe->data) {
> +?????????????? ib_dma_unmap_single(ibdev, qe->dma, capsule_size, dir);
> +?????????????? kfree(qe->data);
> +?????????????? qe->data = NULL;
> +?????? }
>  ?}

I'd prefer to keep this check to the async event buffer:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 9f5fc7b7fd7f..69c9b7bc0d5b 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -732,8 +732,11 @@ static void nvme_rdma_destroy_admin_queue(struct 
nvme_rdma_ctrl *ctrl,
                 blk_cleanup_queue(ctrl->ctrl.admin_q);
                 nvme_rdma_free_tagset(&ctrl->ctrl, 
ctrl->ctrl.admin_tagset);
         }
-       nvme_rdma_free_qe(ctrl->device->dev, &ctrl->async_event_sqe,
-               sizeof(struct nvme_command), DMA_TO_DEVICE);
+       if (ctrl->async_event_sqe.data) {
+               nvme_rdma_free_qe(ctrl->device->dev, &ctrl->async_event_sqe,
+                       sizeof(struct nvme_command), DMA_TO_DEVICE);
+                       ctrl->async_event_sqe.data = NULL;
+       }
         nvme_rdma_free_queue(&ctrl->queues[0]);
  }
--

Does this make the issue go away Max?

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/7] nvme-rdma: unquiesce queues when deleting the controller
  2018-06-21  9:57   ` Max Gurtovoy
@ 2018-06-24 16:07     ` Sagi Grimberg
  0 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2018-06-24 16:07 UTC (permalink / raw)



> On 6/19/2018 3:34 PM, Sagi Grimberg wrote:
>> If the controller is going away, we need to unquiesce the io
>> queues so that all pending request can fail gracefully before
>> moving forward with controller deletion.
>>
>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>> ---
>> ? drivers/nvme/host/rdma.c | 6 ++++--
>> ? 1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index 15e897ac2506..edf63cd106ff 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -1741,10 +1741,12 @@ static void nvme_rdma_shutdown_ctrl(struct 
>> nvme_rdma_ctrl *ctrl, bool shutdown)
>> ????????? nvme_rdma_destroy_io_queues(ctrl, shutdown);
>> ????? }
>> -??? if (shutdown)
>> +??? if (shutdown) {
>> +??????? nvme_start_queues(&ctrl->ctrl);
> 
> any reason we unquiesce IO queues after destroying them and unquiesce 
> Admin queue before the destruction ?

The barrier is stop+cancel. But I think that if we remove the
controller completely we'd need to do it before blk_cleanup_queue
as it may block on blk_freze_queue...

Let me try and have another look, and possible send a patch..

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Suspected-Phishing]Re: [PATCH 2/7] nvme-rdma: fix possible free non-allocated async event buffer
  2018-06-24 16:00         ` Sagi Grimberg
@ 2018-06-24 16:19           ` Max Gurtovoy
  0 siblings, 0 replies; 25+ messages in thread
From: Max Gurtovoy @ 2018-06-24 16:19 UTC (permalink / raw)




On 6/24/2018 7:00 PM, Sagi Grimberg wrote:
> 
>> this fix still doesn't fix the issue that V1 from Israel did.
> 
> This fixes the issue that was reported (nvme_rdma_free_queue frees
> unallocated async_event_sqe).
> 
>> we still can call nvme_rdma_destroy_admin_queue few times (as I 
>> mentioned nvme_rdma_free_queue is safe by NVME_RDMA_Q_ALLOCATED bit 
>> but nvme_rdma_destroy_admin_queue isn't).
> 
> ugh... yes this is not enough to avoid the double free if
> reset/reconnect calls nvme_delete_ctrl.
> 
>> I added this check but I think this is not enough since things can run 
>> concurrently (I still need to make some testing and review):
> 
> The destroy routines are serialized by the work flush/cancel so it
> should be fine.

Yup, I agree.

> 
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index 5194568..e99ff80 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -168,8 +168,11 @@ static inline size_t 
>> nvme_rdma_inline_data_size(struct nvme_rdma_queue *queue)
>> ??static void nvme_rdma_free_qe(struct ib_device *ibdev, struct 
>> nvme_rdma_qe *qe,
>> ???????????????? size_t capsule_size, enum dma_data_direction dir)
>> ??{
>> -?????? ib_dma_unmap_single(ibdev, qe->dma, capsule_size, dir);
>> -?????? kfree(qe->data);
>> +?????? if (qe->data) {
>> +?????????????? ib_dma_unmap_single(ibdev, qe->dma, capsule_size, dir);
>> +?????????????? kfree(qe->data);
>> +?????????????? qe->data = NULL;
>> +?????? }
>> ??}
> 
> I'd prefer to keep this check to the async event buffer:
> -- 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 9f5fc7b7fd7f..69c9b7bc0d5b 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -732,8 +732,11 @@ static void nvme_rdma_destroy_admin_queue(struct 
> nvme_rdma_ctrl *ctrl,
>  ??????????????? blk_cleanup_queue(ctrl->ctrl.admin_q);
>  ??????????????? nvme_rdma_free_tagset(&ctrl->ctrl, 
> ctrl->ctrl.admin_tagset);
>  ??????? }
> -?????? nvme_rdma_free_qe(ctrl->device->dev, &ctrl->async_event_sqe,
> -?????????????? sizeof(struct nvme_command), DMA_TO_DEVICE);
> +?????? if (ctrl->async_event_sqe.data) {
> +?????????????? nvme_rdma_free_qe(ctrl->device->dev, 
> &ctrl->async_event_sqe,
> +?????????????????????? sizeof(struct nvme_command), DMA_TO_DEVICE);
> +?????????????????????? ctrl->async_event_sqe.data = NULL;
> +?????? }
>  ??????? nvme_rdma_free_queue(&ctrl->queues[0]);
>  ?}
> -- 
> 
> Does this make the issue go away Max?

Yes, The issue is with async_event_sqe.
You can add
Reviewed-by: Max Gurtovoy <maxg at mellanox.com>
Tested-by: Max Gurtovoy <maxg at mellanox.com>

to the fix.

-Max.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-06-24 16:19 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-19 12:34 [PATCH 0/7] few nvme-rdma fixes for 4.18 Sagi Grimberg
2018-06-19 12:34 ` [PATCH 1/7] nvme-rdma: fix possible double free condition when failing to create a controller Sagi Grimberg
2018-06-20  9:06   ` Max Gurtovoy
2018-06-20 10:41     ` Sagi Grimberg
2018-06-20 12:29       ` Christoph Hellwig
2018-06-19 12:34 ` [PATCH 2/7] nvme-rdma: fix possible free non-allocated async event buffer Sagi Grimberg
2018-06-20  8:31   ` Christoph Hellwig
2018-06-20 12:02     ` Max Gurtovoy
2018-06-24 10:40       ` [Suspected-Phishing]Re: " Max Gurtovoy
2018-06-24 16:00         ` Sagi Grimberg
2018-06-24 16:19           ` Max Gurtovoy
2018-06-19 12:34 ` [PATCH 3/7] nvme-rdma: Fix command completion race at error recovery Sagi Grimberg
2018-06-19 12:34 ` [PATCH 4/7] nvme-rdma: unquiesce queues when deleting the controller Sagi Grimberg
2018-06-21  9:57   ` Max Gurtovoy
2018-06-24 16:07     ` Sagi Grimberg
2018-06-19 12:34 ` [PATCH 5/7] nvme-rdma: don't override opts->queue_size Sagi Grimberg
2018-06-19 16:56   ` Daniel Verkamp
2018-06-19 12:34 ` [PATCH 6/7] nvme-rdma: centralize controller setup sequence Sagi Grimberg
2018-06-19 12:34 ` [PATCH 7/7] nvme-rdma: centralize admin/io queue teardown sequence Sagi Grimberg
2018-06-20  8:40 ` [PATCH 0/7] few nvme-rdma fixes for 4.18 Christoph Hellwig
2018-06-20  8:40   ` Sagi Grimberg
2018-06-20  8:52     ` Christoph Hellwig
2018-06-20  8:53       ` Sagi Grimberg
2018-06-20  9:05         ` Christoph Hellwig
2018-06-20  9:08           ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.