From mboxrd@z Thu Jan 1 00:00:00 1970 From: sagi@grimberg.me (Sagi Grimberg) Date: Sun, 4 Sep 2016 12:17:34 +0300 Subject: nvmf/rdma host crash during heavy load and keep alive recovery In-Reply-To: <01c301d20485$0dfcd2c0$29f67840$@opengridcomputing.com> References: <018301d1e9e1$da3b2e40$8eb18ac0$@opengridcomputing.com> <20160801110658.GF16141@lst.de> <008801d1ec00$a0bcfbf0$e236f3d0$@opengridcomputing.com> <015801d1ec3d$0ca07ea0$25e17be0$@opengridcomputing.com> <010f01d1f31e$50c8cb40$f25a61c0$@opengridcomputing.com> <013701d1f320$57b185d0$07149170$@opengridcomputing.com> <018401d1f32b$792cfdb0$6b86f910$@opengridcomputing.com> <01a301d1f339$55ba8e70$012fab50$@opengridcomputing.com> <2fb1129c-424d-8b2d-7101-b9471e897dc8@grimberg.me> <004701d1f3d8$760660b0$62132210$@opengridcomputing.com> <008101d1f3de$557d2850$007778f0$@opengridcomputing.com> <00fe01d1f3e8$8992b330$9cb81990$@opengridcomputing.com> <01c301d1f702$d28c7270$77a55750$@opengridcomputing.com> <6ef9b0d1-ce84-4598-74db-7adeed313bb6@grimberg.me> <045601d1f803$a9d73a20$fd85ae60$@opengridcomputing.com> <69c0e819-76d9-286b-c4fb-22f087f36ff1@grimberg.me> <08b701d1f8ba$a709ae10$f51d0a30$@opengridcomputing.com> <01c301d20485$0dfcd2c0$29f67840$@opengridcomputing.com> Message-ID: <0c159abb-24ee-21bf-09d2-9fe7d269a2eb@grimberg.me> Hey Steve, > Ok, back to this issue. :) > > The same crash happens with mlx4_ib, so this isn't related to cxgb4. To sum up: > > With pending NVME IO on the nvme-rdma host, and in the presence of kato > recovery/reconnect due to the target going away, some NVME requests get > restarted that are referencing nvmf controllers that have freed queues. I see > this also with my recent v4 series that corrects the recovery problems with > nvme-rdma when the target is down, but without pending IO. > > So the crash in this email is yet another issue that we see when the nvme host > has lots of pending IO requests during kato recovery/reconnect... > > My findings to date: the IO is not an admin queue IO. It is not the kato > messages. The io queue has been stopped, yet the request is attempted and > causes the crash. > > Any help is appreciated... So in the current state, my impression is that we are seeing a request queued when we shouldn't (or at least assume we won't). Given that you run heavy load to reproduce this, I can only suspect that this is a race condition. Does this happen if you change the reconnect delay to be something different than 10 seconds? (say 30?) Can you also give patch [1] a try? It's not a solution, but I want to see if it hides the problem... Now, given that you already verified that the queues are stopped with BLK_MQ_S_STOPPED, I'm looking at blk-mq now. I see that blk_mq_run_hw_queue() and __blk_mq_run_hw_queue() indeed take BLK_MQ_S_STOPPED into account. Theoretically if we free the queue pairs after we passed these checks while the rq_list is being processed then we can end-up with this condition, but given that it takes essentially forever (10 seconds) I tend to doubt this is the case. HCH, Jens, Keith, any useful pointers for us? To summarize we see a stray request being queued long after we set BLK_MQ_S_STOPPED (and by long I mean 10 seconds). [1]: -- diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index d2f891efb27b..38ea5dab4524 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -701,20 +701,13 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work) bool changed; int ret; - if (ctrl->queue_count > 1) { - nvme_rdma_free_io_queues(ctrl); - - ret = blk_mq_reinit_tagset(&ctrl->tag_set); - if (ret) - goto requeue; - } - - nvme_rdma_stop_and_free_queue(&ctrl->queues[0]); ret = blk_mq_reinit_tagset(&ctrl->admin_tag_set); if (ret) goto requeue; + nvme_rdma_stop_and_free_queue(&ctrl->queues[0]); + ret = nvme_rdma_init_queue(ctrl, 0, NVMF_AQ_DEPTH); if (ret) goto requeue; @@ -732,6 +725,12 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work) nvme_start_keep_alive(&ctrl->ctrl); if (ctrl->queue_count > 1) { + ret = blk_mq_reinit_tagset(&ctrl->tag_set); + if (ret) + goto stop_admin_q; + + nvme_rdma_free_io_queues(ctrl); + ret = nvme_rdma_init_io_queues(ctrl); if (ret) goto stop_admin_q; --