* [PATCH 0/2 v3] Fix nvme-rdma timeout flow @ 2018-04-11 16:07 Israel Rukshin 2018-04-11 16:07 ` [PATCH 1/2 v3] nvme-rdma: Fix race between queue timeout and error recovery Israel Rukshin 2018-04-11 16:07 ` [PATCH 2/2 v3] nvme-rdma: Fix command completion race at " Israel Rukshin 0 siblings, 2 replies; 6+ messages in thread From: Israel Rukshin @ 2018-04-11 16:07 UTC (permalink / raw) Hi all, This patch series fixes a bug that was reproduced while getting block mq IO timeout (causing nvmf to reset controller) running with rdma transport. The bug is a NULL deref of a request mr: BUG: unable to handle kernel NULL pointer dereference at 0000000000000014 IP: __nvme_rdma_recv_done.isra.48+0x1ba/0x300 [nvme_rdma] Call Trace: <IRQ> nvme_rdma_recv_done+0x12/0x20 [nvme_rdma] __ib_process_cq+0x58/0xb0 [ib_core] ib_poll_handler+0x1d/0x70 [ib_core] irq_poll_softirq+0x98/0xf0 __do_softirq+0xbc/0x1c0 irq_exit+0x9a/0xb0 do_IRQ+0x4c/0xd0 common_interrupt+0x90/0x90 </IRQ> The bug happens because we complete the request before handling the good rdma completion. When completing the request we return its mr to the mr pool (and set the request's mr pointer to NULL) and also unmap its data. This may lead also to a memory corruption like was reported by VastData. My two patches fix those problems by completing the requests only after we finish handling all the good completions and the qp is in error state. The current code complete the requests from several places: - rdma completions - block mq timeout work - nvme abort commands (nvme_cancel_request()) The first commit don't let the block layer to complete the request. Those requests will be completed by nvme abort mechanism. So now we have a race only between two places. The second commit fix the race between rdma completions and nvme abort commands. It fixes the race by flushing all the rdma completions before starting the abort commands mechanism. Change from v1: - Adding cover letter Change from v2: - Edit bug description Israel Rukshin (2): nvme-rdma: Fix race between queue timeout and error recovery nvme-rdma: Fix command completion race at error recovery drivers/nvme/host/rdma.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) -- 1.8.3.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/2 v3] nvme-rdma: Fix race between queue timeout and error recovery 2018-04-11 16:07 [PATCH 0/2 v3] Fix nvme-rdma timeout flow Israel Rukshin @ 2018-04-11 16:07 ` Israel Rukshin 2018-04-12 13:36 ` Christoph Hellwig 2018-04-11 16:07 ` [PATCH 2/2 v3] nvme-rdma: Fix command completion race at " Israel Rukshin 1 sibling, 1 reply; 6+ messages in thread From: Israel Rukshin @ 2018-04-11 16:07 UTC (permalink / raw) When returning BLK_EH_HANDLED from nvme_rdma_timeout() the block layer complete the request. Returning BLK_EH_RESET_TIMER is safe because those requests will be completed later by nvme abort mechanism. Completing the requests in the timeout handler was done while the rdma queues were active. When completing the request we return its mr to the mr pool (set mr to NULL) and also unmap its data. This leads to a NULL deref of the mr if we get a rdma completion of a completed request. This also lead to unmapping the request data before it is really safe. Signed-off-by: Israel Rukshin <israelr at mellanox.com> Reviewed-by: Max Gurtovoy <maxg at mellanox.com> --- drivers/nvme/host/rdma.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 758537e..c1abfc8 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1595,10 +1595,7 @@ static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id, /* queue error recovery */ nvme_rdma_error_recovery(req->queue->ctrl); - /* fail with DNR on cmd timeout */ - nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR; - - return BLK_EH_HANDLED; + return BLK_EH_RESET_TIMER; } /* -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 1/2 v3] nvme-rdma: Fix race between queue timeout and error recovery 2018-04-11 16:07 ` [PATCH 1/2 v3] nvme-rdma: Fix race between queue timeout and error recovery Israel Rukshin @ 2018-04-12 13:36 ` Christoph Hellwig 0 siblings, 0 replies; 6+ messages in thread From: Christoph Hellwig @ 2018-04-12 13:36 UTC (permalink / raw) Let's wait for completion of the block error handling discussion on this one.. ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 2/2 v3] nvme-rdma: Fix command completion race at error recovery 2018-04-11 16:07 [PATCH 0/2 v3] Fix nvme-rdma timeout flow Israel Rukshin 2018-04-11 16:07 ` [PATCH 1/2 v3] nvme-rdma: Fix race between queue timeout and error recovery Israel Rukshin @ 2018-04-11 16:07 ` Israel Rukshin 2018-04-12 13:37 ` Christoph Hellwig 1 sibling, 1 reply; 6+ messages in thread From: Israel Rukshin @ 2018-04-11 16:07 UTC (permalink / raw) The race is between completing the request at error recovery work and rdma completions. If we cancel the request before getting the good rdma completion we get a NULL deref of the request MR at nvme_rdma_process_nvme_rsp(). When Canceling the request we return its mr to the mr pool (set mr to NULL) and also unmap its data. Canceling the requests while the rdma queues are active is not safe. Because rdma queues are active and we get good rdma completions that can use the mr pointer which may be NULL. Completing the request too soon may lead also to performing DMA to/from user buffers which might have been already unmapped. The commit fixes the race by draining the QP before starting the abort commands mechanism. Signed-off-by: Israel Rukshin <israelr at mellanox.com> Reviewed-by: Max Gurtovoy <maxg at mellanox.com> --- drivers/nvme/host/rdma.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index c1abfc8..56fccac 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -734,7 +734,6 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl, static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl, bool remove) { - nvme_rdma_stop_queue(&ctrl->queues[0]); if (remove) { blk_cleanup_queue(ctrl->ctrl.admin_q); nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.admin_tagset); @@ -817,7 +816,6 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, static void nvme_rdma_destroy_io_queues(struct nvme_rdma_ctrl *ctrl, bool remove) { - nvme_rdma_stop_io_queues(ctrl); if (remove) { blk_cleanup_queue(ctrl->ctrl.connect_q); nvme_rdma_free_tagset(&ctrl->ctrl, ctrl->ctrl.tagset); @@ -947,6 +945,7 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work) return; destroy_admin: + nvme_rdma_stop_queue(&ctrl->queues[0]); nvme_rdma_destroy_admin_queue(ctrl, false); requeue: dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n", @@ -963,12 +962,14 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work) if (ctrl->ctrl.queue_count > 1) { nvme_stop_queues(&ctrl->ctrl); + nvme_rdma_stop_io_queues(ctrl); blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request, &ctrl->ctrl); nvme_rdma_destroy_io_queues(ctrl, false); } blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_rdma_stop_queue(&ctrl->queues[0]); blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_cancel_request, &ctrl->ctrl); nvme_rdma_destroy_admin_queue(ctrl, false); @@ -1725,6 +1726,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown) { if (ctrl->ctrl.queue_count > 1) { nvme_stop_queues(&ctrl->ctrl); + nvme_rdma_stop_io_queues(ctrl); blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request, &ctrl->ctrl); nvme_rdma_destroy_io_queues(ctrl, shutdown); @@ -1736,6 +1738,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown) nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap); blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_rdma_stop_queue(&ctrl->queues[0]); blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_cancel_request, &ctrl->ctrl); blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); @@ -2001,6 +2004,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev, return &ctrl->ctrl; out_remove_admin_queue: + nvme_rdma_stop_queue(&ctrl->queues[0]); nvme_rdma_destroy_admin_queue(ctrl, true); out_kfree_queues: kfree(ctrl->queues); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/2 v3] nvme-rdma: Fix command completion race at error recovery 2018-04-11 16:07 ` [PATCH 2/2 v3] nvme-rdma: Fix command completion race at " Israel Rukshin @ 2018-04-12 13:37 ` Christoph Hellwig 2018-04-15 13:50 ` Max Gurtovoy 0 siblings, 1 reply; 6+ messages in thread From: Christoph Hellwig @ 2018-04-12 13:37 UTC (permalink / raw) We really need to keep this code in sync with the various transports, chances are any race you see in RDMA will happen elsewhere as well. ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 2/2 v3] nvme-rdma: Fix command completion race at error recovery 2018-04-12 13:37 ` Christoph Hellwig @ 2018-04-15 13:50 ` Max Gurtovoy 0 siblings, 0 replies; 6+ messages in thread From: Max Gurtovoy @ 2018-04-15 13:50 UTC (permalink / raw) On 4/12/2018 4:37 PM, Christoph Hellwig wrote: > We really need to keep this code in sync with the various transports, > chances are any race you see in RDMA will happen elsewhere as well. > I saw that the pci driver calls nvme_pci_disable so it will not get any interrupts/completion from the drive. so the nvme_cancel_request loop is safe. Regarding FC we need James here. If you're talking about pushing it to common code in the core layer, then I guess it's a good idea with/without this patchset. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2018-04-15 13:50 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-11 16:07 [PATCH 0/2 v3] Fix nvme-rdma timeout flow Israel Rukshin 2018-04-11 16:07 ` [PATCH 1/2 v3] nvme-rdma: Fix race between queue timeout and error recovery Israel Rukshin 2018-04-12 13:36 ` Christoph Hellwig 2018-04-11 16:07 ` [PATCH 2/2 v3] nvme-rdma: Fix command completion race at " Israel Rukshin 2018-04-12 13:37 ` Christoph Hellwig 2018-04-15 13:50 ` Max Gurtovoy
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.