All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2 v3] Fix nvme-rdma timeout flow
@ 2018-04-11 16:07 Israel Rukshin
  2018-04-11 16:07 ` [PATCH 1/2 v3] nvme-rdma: Fix race between queue timeout and error recovery Israel Rukshin
  2018-04-11 16:07 ` [PATCH 2/2 v3] nvme-rdma: Fix command completion race at " Israel Rukshin
  0 siblings, 2 replies; 6+ messages in thread
From: Israel Rukshin @ 2018-04-11 16:07 UTC (permalink / raw)


Hi all,

This patch series fixes a bug that was reproduced while getting
block mq IO timeout (causing nvmf to reset controller) running
with rdma transport.

The bug is a NULL deref of a request mr:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
 IP: __nvme_rdma_recv_done.isra.48+0x1ba/0x300 [nvme_rdma]
 Call Trace:
  <IRQ>
  nvme_rdma_recv_done+0x12/0x20 [nvme_rdma]
  __ib_process_cq+0x58/0xb0 [ib_core]
  ib_poll_handler+0x1d/0x70 [ib_core]
  irq_poll_softirq+0x98/0xf0
  __do_softirq+0xbc/0x1c0
  irq_exit+0x9a/0xb0
  do_IRQ+0x4c/0xd0
  common_interrupt+0x90/0x90
  </IRQ>


The bug happens because we complete the request before handling
the good rdma completion.
When completing the request we return its mr to the mr pool
(and set the request's mr pointer to NULL) and also unmap its data.
This may lead also to a memory corruption like was reported by VastData.

My two patches fix those problems by completing the requests only after
we finish handling all the good completions and the qp is in error state.

The current code complete the requests from several places:
 - rdma completions
 - block mq timeout work
 - nvme abort commands (nvme_cancel_request())

The first commit don't let the block layer to complete the request.
Those requests will be completed by nvme abort mechanism.
So now we have a race only between two places.

The second commit fix the race between rdma completions and
nvme abort commands.
It fixes the race by flushing all the rdma completions before
starting the abort commands mechanism.

Change from v1:
 - Adding cover letter

Change from v2:
 - Edit bug description

Israel Rukshin (2):
  nvme-rdma: Fix race between queue timeout and error recovery
  nvme-rdma: Fix command completion race at error recovery

 drivers/nvme/host/rdma.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2 v3] nvme-rdma: Fix race between queue timeout and error recovery
  2018-04-11 16:07 [PATCH 0/2 v3] Fix nvme-rdma timeout flow Israel Rukshin
@ 2018-04-11 16:07 ` Israel Rukshin
  2018-04-12 13:36   ` Christoph Hellwig
  2018-04-11 16:07 ` [PATCH 2/2 v3] nvme-rdma: Fix command completion race at " Israel Rukshin
  1 sibling, 1 reply; 6+ messages in thread
From: Israel Rukshin @ 2018-04-11 16:07 UTC (permalink / raw)


When returning BLK_EH_HANDLED from nvme_rdma_timeout() the block layer
complete the request.
Returning BLK_EH_RESET_TIMER is safe because those requests will be completed
later by nvme abort mechanism.

Completing the requests in the timeout handler was done while
the rdma queues were active.
When completing the request we return its mr to the mr pool (set mr to NULL)
and also unmap its data.
This leads to a NULL deref of the mr if we get a rdma completion of a
completed request.
This also lead to unmapping the request data before it is really safe.

Signed-off-by: Israel Rukshin <israelr at mellanox.com>
Reviewed-by: Max Gurtovoy <maxg at mellanox.com>
---
 drivers/nvme/host/rdma.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 758537e..c1abfc8 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1595,10 +1595,7 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
 	/* queue error recovery */
 	nvme_rdma_error_recovery(req->queue->ctrl);
 
-	/* fail with DNR on cmd timeout */
-	nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
-
-	return BLK_EH_HANDLED;
+	return BLK_EH_RESET_TIMER;
 }
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2 v3] nvme-rdma: Fix command completion race at error recovery
  2018-04-11 16:07 [PATCH 0/2 v3] Fix nvme-rdma timeout flow Israel Rukshin
  2018-04-11 16:07 ` [PATCH 1/2 v3] nvme-rdma: Fix race between queue timeout and error recovery Israel Rukshin
@ 2018-04-11 16:07 ` Israel Rukshin
  2018-04-12 13:37   ` Christoph Hellwig
  1 sibling, 1 reply; 6+ messages in thread
From: Israel Rukshin @ 2018-04-11 16:07 UTC (permalink / raw)


The race is between completing the request at error recovery work
and rdma completions.
If we cancel the request before getting the good rdma completion
we get a NULL deref of the request MR at nvme_rdma_process_nvme_rsp().

When Canceling the request we return its mr to the mr pool (set mr to NULL)
and also unmap its data.
Canceling the requests while the rdma queues are active is not safe.
Because rdma queues are active and we get good rdma completions
that can use the mr pointer which may be NULL.
Completing the request too soon may lead also to performing DMA to/from
user buffers which might have been already unmapped.

The commit fixes the race by draining the QP before
starting the abort commands mechanism.

Signed-off-by: Israel Rukshin <israelr at mellanox.com>
Reviewed-by: Max Gurtovoy <maxg at mellanox.com>
---
 drivers/nvme/host/rdma.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index c1abfc8..56fccac 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -734,7 +734,6 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl,
 		bool remove)
 {
-	nvme_rdma_stop_queue(&ctrl->queues[0]);
 	if (remove) {
 		blk_cleanup_queue(ctrl->ctrl.admin_q);
 		nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.admin_tagset);
@@ -817,7 +816,6 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 static void nvme_rdma_destroy_io_queues(struct nvme_rdma_ctrl *ctrl,
 		bool remove)
 {
-	nvme_rdma_stop_io_queues(ctrl);
 	if (remove) {
 		blk_cleanup_queue(ctrl->ctrl.connect_q);
 		nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.tagset);
@@ -947,6 +945,7 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 	return;
 
 destroy_admin:
+	nvme_rdma_stop_queue(&ctrl->queues[0]);
 	nvme_rdma_destroy_admin_queue(ctrl, false);
 requeue:
 	dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
@@ -963,12 +962,14 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
 
 	if (ctrl->ctrl.queue_count > 1) {
 		nvme_stop_queues(&ctrl->ctrl);
+		nvme_rdma_stop_io_queues(ctrl);
 		blk_mq_tagset_busy_iter(&ctrl->tag_set,
 					nvme_cancel_request, &ctrl->ctrl);
 		nvme_rdma_destroy_io_queues(ctrl, false);
 	}
 
 	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
+	nvme_rdma_stop_queue(&ctrl->queues[0]);
 	blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
 				nvme_cancel_request, &ctrl->ctrl);
 	nvme_rdma_destroy_admin_queue(ctrl, false);
@@ -1725,6 +1726,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown)
 {
 	if (ctrl->ctrl.queue_count > 1) {
 		nvme_stop_queues(&ctrl->ctrl);
+		nvme_rdma_stop_io_queues(ctrl);
 		blk_mq_tagset_busy_iter(&ctrl->tag_set,
 					nvme_cancel_request, &ctrl->ctrl);
 		nvme_rdma_destroy_io_queues(ctrl, shutdown);
@@ -1736,6 +1738,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown)
 		nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap);
 
 	blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
+	nvme_rdma_stop_queue(&ctrl->queues[0]);
 	blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
 				nvme_cancel_request, &ctrl->ctrl);
 	blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
@@ -2001,6 +2004,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 	return &ctrl->ctrl;
 
 out_remove_admin_queue:
+	nvme_rdma_stop_queue(&ctrl->queues[0]);
 	nvme_rdma_destroy_admin_queue(ctrl, true);
 out_kfree_queues:
 	kfree(ctrl->queues);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 1/2 v3] nvme-rdma: Fix race between queue timeout and error recovery
  2018-04-11 16:07 ` [PATCH 1/2 v3] nvme-rdma: Fix race between queue timeout and error recovery Israel Rukshin
@ 2018-04-12 13:36   ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2018-04-12 13:36 UTC (permalink / raw)


Let's wait for completion of the block error handling discussion on
this one..

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/2 v3] nvme-rdma: Fix command completion race at error recovery
  2018-04-11 16:07 ` [PATCH 2/2 v3] nvme-rdma: Fix command completion race at " Israel Rukshin
@ 2018-04-12 13:37   ` Christoph Hellwig
  2018-04-15 13:50     ` Max Gurtovoy
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2018-04-12 13:37 UTC (permalink / raw)


We really need to keep this code in sync with the various transports,
chances are any race you see in RDMA will happen elsewhere as well.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/2 v3] nvme-rdma: Fix command completion race at error recovery
  2018-04-12 13:37   ` Christoph Hellwig
@ 2018-04-15 13:50     ` Max Gurtovoy
  0 siblings, 0 replies; 6+ messages in thread
From: Max Gurtovoy @ 2018-04-15 13:50 UTC (permalink / raw)




On 4/12/2018 4:37 PM, Christoph Hellwig wrote:
> We really need to keep this code in sync with the various transports,
> chances are any race you see in RDMA will happen elsewhere as well.
> 

I saw that the pci driver calls nvme_pci_disable so it will not get any 
interrupts/completion from the drive. so the nvme_cancel_request loop is 
safe.
Regarding FC we need James here.

If you're talking about pushing it to common code in the core layer, 
then I guess it's a good idea with/without this patchset.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-04-15 13:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-11 16:07 [PATCH 0/2 v3] Fix nvme-rdma timeout flow Israel Rukshin
2018-04-11 16:07 ` [PATCH 1/2 v3] nvme-rdma: Fix race between queue timeout and error recovery Israel Rukshin
2018-04-12 13:36   ` Christoph Hellwig
2018-04-11 16:07 ` [PATCH 2/2 v3] nvme-rdma: Fix command completion race at " Israel Rukshin
2018-04-12 13:37   ` Christoph Hellwig
2018-04-15 13:50     ` Max Gurtovoy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.